Made by Team Pottery

Team Pottery: Crafting Code, Firing Up Success!

Our Terrific Totes Project utilised an ETL pipeline orchestrated by a Step Function triggered every 30 minutes . Our Extract Lambda Handler connects to the Totesys database, checking for new data. Any new data found is then added to our S3 ingestion bucket as a CSV file, organised by date. Next, our Transformation Lambda Handler pulls the latest modified files from each table’s directory in the ingestion bucket, converting them into DataFrames. These are held within a dictionary structure, allowing our 11 specialized transform utilities to access and create the dimensions and facts that form our final product. These dimensions and facts are then uploaded to the processed bucket in parquet format. Finally, the Load Lambda Handler accesses the most recently modified Parquet files from the processed bucket. It establishes a database connection and then efficiently writes or updates each dimension and fact, completing the data flow. To check the quality and security of our code Flake8, black, pip-audit, and pytest-cov were integrated into our CICD pipeline with a total testing coverage of 88% across the project. To integrate the various components of our project, we used Terraform, AWS, and Python. Terraform automated the deployment of all resources, while each Python script was packaged and deployed as an individual AWS Lambda function, with the necessary dependencies included as Lambda layers. This setup connects to two Postgres databases—one for ingestion and one for loading—and provides the Lambda functions with the credentials they need to access them. For monitoring, we used CloudWatch metric filters to detect “ERROR” logs in Lambda functions, triggering alarms that notify multiple email subscribers. We also defined IAM roles for all Lambda functions and step Functions, giving them the specific permissions they need to access S3 buckets, CloudWatch Logs, and to invoke other functions. Finally, we used two S3 buckets: one for ingesting raw data, and another for storing the transformed data.

The Team

Sam Brear

My entire career has involved working with data,…

encompassing tasks such as organising staff training records, developing a database of all marine organisms on a reef and, as a teacher, analysing student achievement trends to identify reasons for cohort underperformance. I decided to formalise and consolidate the data skills acquired in my previous roles by enrolling in Northcoders Data Engineering bootcamp. I am excited to move into a purely technical role in the Data domain that will allow me to combine all my experience and continue to develop my technical and analytical skills.

Vanessa Gouveia

I’m a driven Mathematics graduate currently developing…

strong data engineering skills through the Northcoders bootcamp. I’ve gained hands-on experience with Python, SQL, and AWS, with a focus on data handling and object-oriented programming (OOP). I’m eager to apply my analytical mindset in a dynamic, fast-paced environment and continue growing as a data professional by building on both my technical and problem-solving abilities.

James Lippmann

Data Engineer with a background in leading creative…

projects and a passion for turning data into meaningful insight. My journey into tech began through analysing streaming and audience data in the music industry, which led me to deepen my technical skill set through a data engineering bootcamp at Northcoders. There, I built hands-on experience with Python, SQL, AWS, and cloud technologies. Skilled in Jira and Agile ways of working, I approach complex problems with structure and collaboration. I’m excited to apply both my creative and technical strengths to a data-driven role in tech. I also hold a PRINCE2 Foundation certification and an ACCA Diploma in Accounting & Business, supporting a structured, commercially aware approach to projects.

Hamoud Alzafiry

I’m a Software Developer with 4.5 years of experience…

delivering high-quality code on projects spanning multiple components of a core banking system. My work has consistently involved data-driven development, which sparked a deep interest in data engineering. To align my career with this passion, I recently completed a specialised Data Engineering bootcamp with Northcoders. I’m now seeking to transition into a Data Engineer role where I can apply both my solid software foundation and my growing expertise in data technologies to build scalable, efficient data solutions.

Zainab Sharif

I have switched from the health industry to the Tech…

industry. Just recently, I have completed a data engineering bootcamp at Northcoders. I used various programming languages and technologies such as JavaScript, Python, API’s, AWS, SQL, and more. I am now ready to start my new tech career.

Tech Stack

Northcoders Terrific Totes Project: Technology Stack Overview Our Terrific Totes project leveraged a robust and scalable technology stack to build an efficient Extract, Transform, Load (ETL) pipeline. Core AWS Services AWS Lambda: We used Lambda to host three distinct Python functions, each dedicated to a specific stage of our ETL process: data extraction, data transformation, and data loading. This serverless approach provided a scalable and cost-effective execution environment for our pipeline. Amazon S3: S3 served as our project’s data lake. An ingestion bucket temporarily stored raw data in CSV format after extraction, while a processed bucket held the optimized, transformed data in Parquet format, ready for loading into the data warehouse. Databases & Data Formats PostgreSQL: This relational database played a dual role, acting as both our source transactional database for data extraction and our target analytical data warehouse for refined data. CSV (Comma Separated Values): We chose CSV as the initial format for data extracted from our PostgreSQL source due to its simplicity and ease of generation from database queries. Parquet: For our transformed data, we opted for Parquet, an optimized columnar storage format. Its efficient compression and excellent query performance make it ideal for analytical workloads, and we used it for intermediate processed files stored in S3. Programming & Libraries Python: This was our primary programming language, used for all ETL logic within the AWS Lambda functions, orchestrating every step of the pipeline. Pandas: During the transformation phase, Pandas was essential for manipulating, cleaning, and reshaping the extracted CSV data into optimized dimension and fact tables. PyArrow Parquet: We leveraged PyArrow Parquet to efficiently read and write data in Parquet format, ensuring optimal performance for data storage and retrieval in S3. SQLAlchemy: Employed in the loading Lambda function, SQLAlchemy allowed us to establish robust connections to the PostgreSQL database and seamlessly insert transformed Pandas DataFrames into the target tables. csv module: The extraction Lambda function used Python’s built-in csv module to programmatically write data fetched from the source database into CSV strings. io.StringIO and io.BytesIO: These modules were critical for handling data in memory. StringIO buffered CSV content, while BytesIO processed binary Parquet data, eliminating the need for temporary disk writes and significantly improving efficiency. datetime and timedelta: Used in the extraction Lambda, these modules enabled us to calculate time windows for incremental data loading, ensuring that only new or updated records were processed in subsequent runs. The tech stack chosen for the Northcoders Terrific Totes Project was carefully selected to provide a robust, scalable, and cost-effective solution for their ETL pipeline.

Challenges Faced

Challenges Encountered Throughout the project, we faced several key challenges that we tackled to ensure a robust and secure ETL pipeline: Handling Sensitive Information Early on, One of our primary concerns was securely managing sensitive information. We initially faced difficulties sharing environment variables. we decided on an approach the combines GitHub environment variables, and Terraform secret.TFVars to streamline our CI/CD pipeline and ensure the continuous development of our project . This adjustment ensured our sensitive data were handled securely and correctly. Lambda Layer Management then during development, Deploying layers for our AWS Lambda functions presented a significant hurdle. Initially, we consolidated all dependencies and utilities into a single shared layer for our three Lambda functions. However, this approach led to issues, and we found that splitting our layers into smaller segments is a more stable solution. An additional challenge arose with the substantial size of the pandas and PyArrow dependencies. We resolved this by utilizing AWS’s built-in Pandas layer, which conveniently includes PyArrow and optimized our deployment package size. Database Loading and Data Type Coercion Our final major challenge occurred during the data loading phase to the database. Loads frequently failed due to incorrect data types, particularly with dates that were formatted as strings. To resolve this, we implemented explicit type and format specifications during the transformation stage of our ETL process, ensuring data consistency and successful ingestion into the database.