Made by Team Splash World

Driven by data. Powered by teamwork.

We were approached by our client, Terrific Totes, a tote bag retailer, to build a data pipeline that extracts, transforms, and loads (ETL) sales data from their OLTP database (“Totesys”) into an OLAP data warehouse. The aim was to make their sales data more accessible and easier to analyse. In the extract stage, our pipeline connects to the Totesys PostgreSQL database and ingests updated transaction records from its tables. The raw data is stored in JSON format in an S3 ingestion bucket. During the transformation stage, the latest JSON file for each table is retrieved from the ingestion bucket and loaded into a Pandas DataFrame. The data is then cleaned and transformed into a star schema, consisting of dimension and fact tables optimised for analytical queries. The resulting DataFrames are saved as Parquet files and uploaded to a second S3 bucket designated for processed data. Parquet was chosen for its efficient, columnar storage format and ability to preserve data types, preventing downstream TypeErrors. In the load stage, the Parquet files are extracted from the processed bucket and inserted into a PostgreSQL data warehouse using SQL queries. This completes the ETL pipeline and ensures the data is ready for analysis. The entire pipeline runs automatically every 20 minutes, orchestrated by an AWS Step Function triggered on a schedule. At each stage, activity is logged to AWS CloudWatch, and email alerts are sent to the team in the event of failure. The project was developed using test-driven development (TDD), with 94% test coverage. It also follows best practices for security, including protection against SQL injection and secure credential management through GitHub Secrets. Infrastructure is deployed and managed using Terraform. To support data analysis, we created an interactive Power BI dashboard that visualises insights from the data warehouse. Key findings include: (1) Terrific Totes’ sales have remained relatively stable by quarter since 2022, with little evidence of seasonality. (2) The most popular designs by sales revenue are Bronze, Granite, and Soft. (3) The Republic of Korea is the company’s largest market by sales revenue, with Granite as the top-selling design, though demand has declined since its launch. (4) The highest-performing salesperson in early 2025 is Imani. These insights support Terrific Totes in making real-time, data-driven decisions on global expansion, employee bonuses, and broader strategic planning. We learnt a lot during this experience, and discovered that great team work, using agile practices, is key to delivering a good product.

The Team

Tech Stack

We used Python 3.13, Pytest, Pandas, GitHub, AWS, Terraform, PostgreSQL, PowerBI. We chose this tech stack for the following reasons: (1) team capabilities and skillset; (2) performance optimised for data volume; (3) automatic deployment (with Terraform, using IaC); (4) flexibility to make changes; (5) licensing costs (we used mostly open source software, except for AWS); (6) storage and retention of historical records (using S3 buckets in AWS); (7) ease of data visualisation (Power BI connects seamlessly with out PostgreSQL data warehouse).

Challenges Faced

We found that extracting the full database on each update consumed excessive memory and S3 storage, so we optimized the process by pulling only new sales transactions, using timestamps to detect changes. We noticed sales record amendments in the source database, which initially broke our load function. We fixed this by storing the latest data in the warehouse and keeping a history of transactions in S3. During transformation, we cleaned the data, for example, correcting a misspelling of ‘Leeds’ in the staff records to improve accuracy. We considered alternatives like PySpark for transforming data, but settled on Pandas due to the team’s familiarity with this tool, and ease of converting from json, to data frames, to parquet, through the pipeline. We chose parquet file format to store processed data, as unlike csv, parquet retains data types, preventing type errors from halting our load lambda downstream. When visualising sales revenue in Power BI, we adjusted for currency for accurate insights.