Skip to content

Student Projects – Dimensional Transformers Data Project

Made by Dimensional Transformers

From Data to Discovery: Shaping the Future with Engineering

Our project automates the ETL workflow for Terrific Totes, leveraging Python, AWS, database modeling, and CI/CD practices. It extracts data from a PostgreSQL database on AWS RDS, transforms it with Python Lambda functions, and loads it into a PostgreSQL data warehouse. With secure credential management, CloudWatch monitoring, and Terraform and GitHub Actions automation, this scalable solution ensures quality through rigorous testing and PEP8 compliance, delivering actionable insights via a Tableau dashboard.

The Team

David Sheffield

David Sheffield

No bio provided

Dorota Sawicki

Dorota Sawicki

No bio provided

Hamza Nazar

Hamza Nazar

No bio provided

Laura Pugsley

Laura Pugsley

No bio provided

Rohail Zaheer

Rohail Zaheer

No bio provided

William Robb

William Robb

No bio provided

Tech Stack

Tech Stack for this group

We used: Python, AWS, Postgres, Terraform, GitHub Actions, Parquet, Tableau, Pytest Terraform: Used for managing and deploying infrastructure. Python: Employed for programmatically interacting with and manipulating data. SQL: Utilized due to the database’s structure and requirements. Parquet: Chosen for its efficient data storage capabilities. GitHub Actions: Implemented to automate workflows and minimize manual intervention throughout the project. AWS: Provided the hosting environment for all cloud infrastructure. Pytest: For test driven development. Tableau: For effiecient data visualisation.

Challenges Faced

Statelessness of Lambda Functions: Lambda functions are stateless, so we needed a way to track the last updated date for data extraction. We solved this by storing the date in AWS Secrets Manager, allowing us to persist the value between invocations. On the first run, we reset the date to process all source data. Fact Table Load Order: We encountered an issue where the fact table was loading before the dimension tables, causing reference errors. To fix this, we implemented a sleep mechanism after loading the dimension tables, ensuring they were populated first and maintaining data integrity. Lambda Layer Dependency Issues: We faced challenges with Lambda layer compatibility for PyArrow and SQLAlchemy due to package size. For PyArrow, we used an AWS Managed Layer. For SQLAlchemy, we created a custom GitHub library to package and deploy the specific version as a Lambda layer, resolving the compatibility issues. GitHub: https://github.com/hamza8599/nc-final-project