Made by Bag End: The Fellowship of the Sales
One Data Pipeline to Rule Them All.
Bag End: Fellowship of Sales is a data pipeline designed to extract, transform, and load sales data efficiently. It retrieves raw data from PostgreSQL, processes it into a structured warehouse format using Pandas and PyArrow, and stores it in S3 for downstream analytics. Built with AWS Lambda, Step Functions, and SQLAlchemy, it ensures scalable, automated data processing while handling schema changes, empty extracts, and optimized indexing at the load stage.
The Team
Nicole Rutherford
I was inspired to work in the tech field through my…
interest in analytical thinking and coding. During my studies as a Mathematics student, I appreciated the diverse side of STEM such as researching abstract ideas like my dissertation written about mathematical structures within musical compositions. I have a proven ability to adapt to new challenges, as demonstrated by my experience working abroad, navigating cultural differences, and managing multiple self-driven projects. After moving to China, I started building my own blog to document my experiences, sharing insights about culture shock, the challenges of adapting to life abroad, and how I navigated the lockdown. This journey, combined with my continuous self-study and growing interest in coding, motivated me to pursue a career in tech.
Bonnie Packer
I am a Junior Engineer currently completing the Northcoders…
data engineering boot camp, following a successful career in mental health and criminal justice. With a master’s degree in forensic psychology, I gained a strong foundation in statistical analysis, using R to solve complex problems and uncover insights. My experience managing risk in high-pressure environments has given me a deep understanding of the criminal justice system and a practical approach to problem-solving. I’m excited to leverage my technical skills and psychological expertise to create data-driven solutions that drive meaningful change in sectors such as crime analysis, rehabilitation programmes, and cybersecurity.
Beth Dunstan
I have a background in environmental science with a…
particular interest in tackling global challenges such as climate risk, biodiversity loss, and sustainability. As environmental data grows in scale and complexity, we need better ways to manage and extract meaningful insights from it. I am retraining in data engineering to develop the skills needed to design scalable data pipelines, integrate cloud computing, and leverage AI to transform raw data into actionable insights. By improving how we process and analyze environmental data, we can drive better real-world decision-making and create more effective, sustainable solutions at scale.
Pieter van den Broek
I’m a maths and philosophy graduate, currently completing a…
full-time course in data engineering. I’m looking to work in data analysis.
Luke Guantlett
As an Aspiring Data Engineer, I have tried a variety of…
jobs but always get drawn back to coding and automation due to my curiosity and desire for efficiency. I take pride in my ability to lead teams, analyse data and optimise business processes. I pride myself on being reliable, logical, efficient and hard working. My experiences include web scraping products, building websites, automating the progression of my personal training clients, and project management.
Tech Stack

Cloud & Infrastructure: AWS Lambda – Serverless computing to extract, transform, and load (ETL) data efficiently without managing infrastructure. AWS S3 – Cloud storage for raw and processed sales data, allowing scalable and cost-effective data management. AWS Step Functions – Orchestrates the ETL pipeline, ensuring a smooth workflow between Lambda functions. AWS CloudWatch – Logs and monitors the pipeline, alerting us to errors and performance issues. AWS SNS – Sends notifications for critical failures in the pipeline. Database & Querying PostgreSQL – Reliable, SQL-based database storing structured data from sales transactions. SQLAlchemy (pg8000 driver) – ORM and database toolkit for interacting with PostgreSQL in a Pythonic way. Data Processing & Format Pandas & PyArrow – Used in the transformation Lambda to efficiently handle and convert data to Parquet. Parquet Format – Optimized for analytics, reducing storage costs and improving query speed. Testing & Development Moto (mock_aws) – Mocks AWS services like S3 and SNS for unit testing without incurring AWS costs. Pytest – Structured and modular testing for each part of the pipeline. Unittest.mock – Used to replace database connections and AWS services with controlled mock objects. We needed a reliable and flexible way to interact with our PostgreSQL database from AWS Lambda. SQLAlchemy was chosen because: Abstraction & Maintainability – It allows us to write database queries in a Pythonic way while keeping our code modular. Safe Query Handling – Using bound parameters (e.g., WHERE created_at > :last_extract_time) prevents SQL injection. Flexibility – We can write both raw SQL queries and ORM-based models if needed. Works with pg8000 – Since AWS Lambda has issues with some PostgreSQL drivers (like psycopg2 due to compiled dependencies), pg8000 is pure Python and lightweight, making it a better fit. Alternative Considered: Using raw SQL with pg8000 alone – but SQLAlchemy improves readability, safety, and maintainability. When transforming and storing data, we needed a format that is optimized for analytics and efficiently handles large datasets. PyArrow was chosen because: Parquet Support – It natively supports Parquet, which is an optimal format for our data warehouse. High Performance – It’s faster than Pandas for processing large tables (columnar memory layout). Efficient Storage – Parquet is compressed and columnar, making it faster to read in analytics workloads. AWS-Compatible – Amazon Athena, Redshift, and other services prefer Parquet over JSON/CSV for querying. Alternative Considered: CSV or JSON – But they are slower, take up more space, and are inefficient for analytical queries. When transforming raw sales data, we needed a tool to clean, manipulate, and structure the data before writing it to S3. Pandas was chosen because: DataFrame Operations – Easy filtering, grouping, and type conversion. Integrates with PyArrow – Allows seamless conversion to Parquet. Handles Missing Data – Pandas is great for dealing with nulls and inconsistencies in the source data. Well-Supported – A widely-used industry standard for Python data processing. Alternative Considered: Dask (for parallel processing) – But our dataset is small enough that Pandas + PyArrow works efficiently. Why This Combination? SQLAlchemy + pg8000 → Safely fetch data from PostgreSQL Pandas → Clean, transform, and prepare structured data PyArrow → Convert data efficiently and save as Parquet This stack ensures our pipeline is fast, scalable, and cost-effective while avoiding unnecessary complexity.
Challenges Faced
Challenges We Overcame: Pandas Size Limitation – Our Lambda function exceeded the AWS package size limit when including Pandas, so we switched to using AWS Managed Layers to keep deployments lightweight. Indexing in the Transform Step – Initially, we tried indexing columns in our Transform Lambda, but this made data retrieval from S3 too complex and increased processing time in the Load step. We simplified this by moving indexing to the Load function instead. Handling Empty Files – Our Extract function initially assumed that every table would always have new data, but in reality, updates happen randomly per table. We had to adapt our code to handle cases where some or all tables had no new data, ensuring empty files were still created as expected.
Student Projects – Bag End: The Fellowship of the Sales Final Project
A project created by Team Bag End: The Fellowship of the Sales – One Data Pipeline to Rule Them All.
Student Projects – Trendify
A project made by Team Lemur – When you don’t know what to wear
Student Projects – DungéDex
A project made by Team Oatmeal – Convert any Pokémon into a DnD Monster
Student Projects – TotesOps’ Data Project
A project created by TotesOps – Streamline, Automate, Innovate: Revolutionising TotesOps
Student Projects – The Beekeepers’ Project
A project made by The Beekeepers – A hive of activity
Student Projects – Team Rannoch’s Data Project
A project created by Team Rannoch – We know what you sold last Summer.
Student Projects – Team Ness’ Project
A project made by Team Ness – E.T.L: Extract.Transform. Loch Ness Monster.
Student Projects – Data Dynamo Squad’s Data Project
A project made by the Data Dynamo Squad – The more you know, the more you don’t know.
Student Projects – PiglioTech
A project created by Team Kune Kune – The Book Swapping App
Student Projects – NearSphere
A project made by Team Mangalitsa – NearSphere – where you can find your new favourite spots near you! It is a user friendly app which allows you to search and favourite places nearby.
Student Projects – Media Tracker
A project created by Team Duroc – Remember that movie?
Student Projects – Find My Escape
A project made by Team Tamworth – Your path to adventure starts here…