Skip to content

Made by Lambda Legends

Work of Legends

This is a data engineering project which implements an end-to-end ETL (extract, transform, load) pipeline. It extracts data from a database, transforms it to a star schema and finally loads it into an AWS warehouse. Current features: Data Extraction: Uses a Python application to automatically ingest data from the totesys operational database into an S3 bucket in AWS. Data Transformation: Uses a Python application to process raw data to conform to a star schema for the data warehouse. The transformed data is stored in parquet format in a second S3 bucket. Data Loading: Loads transformed data into an AWS-hosted data warehouse, populating dimensions and fact tables. Automation: End-to-end pipeline triggered by completion of a data job. Monitoring and Alerts: Logs to CloudWatch and sends SNS email alerts in case of failures.

The Team

Pratik Shrestha

Pratik Shrestha

Recently took Independent learning gap, currently a data

engineer based in London, UK

GitHub LinkedIn
Rrezon Mripa

Rrezon Mripa

Currently a junior data engineer based in London

GitHub LinkedIn
Joshua Man

Joshua Man

Recent university graduate, currently a data engineer based

in London, UK

GitHub LinkedIn
Mirriam Karimi

Mirriam Karimi

Former entrepreneur, currently a data engineer based in

Burton-on-Trent,UK

GitHub LinkedIn
Eloise Holland

Eloise Holland

Recent university graduate, currently a data engineer based

in London, UK

GitHub LinkedIn

Tech Stack

Tech Stack for this group

We used: pg8000, pandas, boto3, aws wrangler, pytest, moto, terraform, git, github actions pg8000 for connecting and querying the PostgreSQL database. Pandas for manipulating and transforming data into tables. Boto3 for interacting with AWS services. AWS wrangler for simplifying the process of writing transformed dataframes back to S3 in parquet format during the Transform phase. Pytest for testing. Moto for mocking AWS services during testing. Terraform for defining and provisioning the AWS infrastructure Git: enabled version control for tracking changes in our project code GitHub Actions: Automated testing and deployment workflows to ensure code quality and streamline the CI/CD pipeline.

Challenges Faced

We face challenges during the extraction of data as we wanted to avoid saving the data on our local machines. We also faced challenges with terraform changes not automatically reflected in our lambda functions.