Made by Sperrins Syndicate

Bound by the peaks, driven by purpose.

Terrific Totes is a fictional company that operates an OLTP database and a data warehouse used for reporting and visualisations. The goal of this project was to develop applications that Extract, Transform, and Load (ETL) data from an OLTP database into a data lake and warehouse, hosted in AWS. Our solution is reliable, scalable, and fully managed using Infrastructure-as-Code. The features of our project include: **Automated Data Processing** – EventBridge Scheduler: Triggers data ingestion every 5 minutes. – Step Machine with JSON Payloads: Orchestrates the workflow. – Three Lambda Functions (& Layers): – One Python application ingests all tables from the OLTP database. It then implements incremental refresh so that only new and/or updated data is processed for optimisation of resources. – Another Python application remodels the data into a predefined schema and stores it in the “processed” S3 bucket in Parquet format. – A third Python application loads the transformed data into a data warehouse, with new and updated data reflected in the target database within 30 minutes. Fact tables retain a full history of how the data has evolved. – CloudWatch Monitoring & SNS Alerts: Logs errors, tracks performance, and sends critical failure notifications via SNS. **Secure Data Management** – IAM Policies: Implements the principle of least privilege. – Secrets Manager: Manages database credentials securely. **Code Quality & Security** – Python code is PEP8 compliant, thoroughly tested (with unit and integration tests), and checked for security vulnerabilities using pip-audit and bandit.

The Team

Anas Elsafi

A data engineer who loves turning messy data into something…

useful. With a background in process automation and a knack for optimising data pipelines. I enjoy working with cloud tech and solving data challenges. When I’m not wrangling data, I’m a lucky dad to two wonderful daughters, and I love exploring new music, or enjoying a good cup of coffee.

Angelo Rohanathan

I am a recent Mathematics and Physics graduate with a…

strong analytical mindset and problem-solving skills. I have completed the Northcoders Data Engineering Bootcamp, developing expertise in data pipelines, cloud technologies, and database management. Passionate about leveraging technical skills to solve complex data challenges and eager to apply knowledge in real-world projects. Outside of data engineering, I enjoy bouldering, football and video games.

Chiheng Jiang

I am transitioning from decoding languages to decoding data…

as a Junior Data Engineer—while also an amateur drummer, much to the delight (or dismay) of the neighbours 🙂

Leila Carrier-Sippy

After working for 5 years as a Business Analyst in…

software, I decided to shift into a more data-centric role as working with data was the favourite part of my job. Northcoders has significanty deepened my knowledge of data engineering, and ignited a new passion for coding.

Sezgi Khajotia

With a background in neuroscience, I joined Northcoders…

with some experience using code to prepare and analyse data and extract new and interesting insights. Through the bootcamp, I have taken that curiosity to the next level and now consider myself a budding I data engineer and data scientist! When I’m not trying to solve problems with code, I am either taking care of my two wonderful children, or more recently, out trying to improve my (still novice) tennis skills

Youness Senhaji

Former apprentice software engineer, with experience in…

full stack development, scripting, and automating pipelines and making cool names.

Tech Stack

We used: AWS, GitHub, Terraform, Python, PostgreSQL, Tableau By using the technologies introduced and covered throughout the Data Engineering course, we were able to thoroughly consolidate our knowledge of the course materials. The key languages used in this project were Python (75%), Terraform (24%) and Make (~1%) as reflected on our GitHub repository. AWS services were used to build the infrastructure for our ETL pipeline. The services we used included: CloudWatch, Lambda, EventBridge, Step Functions, SNS, S3, Secrets Manager. In addition, by using terraform and GitHub, we were able to set up our entire infrastructure-as-code and create a project with fully automated deployment. PostgreSQL provided easy integration with python, enabling us to extract and load data to our databases. Tableau was chosen as the application in which to generate our data visualisations, due to ease of availability on our operating systems. Finally, the following libraries were also used in our python code, providing essential additional functionality: boto3, moto, pandas, numpy, pg8000, freezegun, coverage, bandit, black.

Challenges Faced

Here are the top three challenges that we overcame: The first challenge was because the transform lambda sometimes needed two tables to join to create a dim table. Due to our pipeline using incremental refresh, the transform lambda did not always have access to the latest data of the table it needed to join with. The solution was to add logic to the ingest lambda that would send a full current snapshot of the table to join with when it detected changes in the relevant tables. The next challenge was to do with the panda layer needed for the transform and ingest lambdas. We had compatibility issues due to the lambda running different operating systems to our machine and memory issues due to the size of the panda module exceeding the memory constraints of the lambda layer. The solution was to use an AWS managed lambda layer for the panda dependency. Finally, we had issues with module imports for Python files that imported multiple functions across a complex folder structure. The solution was to use sys.path.append so that both the cloud and local versions of our code ran without errors.

Student Projects – Sperrins Syndicate Final Project