Featured works

pipelines

Scrape Football Data Pipeline

This is a data pipeline which consists of scraping football data from the different wesites and ingesting into the cloud with Google Cloud Storage and locally into PostgreSQL. We preceed to define a series of web scraping functions with beautiful to scrape data which is converted into pandas dataframe and later into parquet format which is ingested into Google Cloud Storage which is Google's Data Lake storage service. This project is a simple implementation of data pipeline involving web scraping from a static HTML webpage and ingested into the cloud storage with no orchestration tool.

Dockerised Streaming Data Pipeline with Cassandra and PostgreSQL

This project implements an end to end data pipeline which is dockerized with data consumed in JSON format from an API service. This project demonstrates key concepts such as using data orchestration tools with data processing with Apache Spark. To ensure facilitate depolyment, scalability and reproducibility we use Docker to containerize the data pipeline. ?

Change Data Capture Project with streaming transactional data.

This project implements the concept of Change Data Capture in realtime with Postgres, Debezium, Kafka and and Apache Spark. The importance of CDC is to provide a robust mechanism for data governance and auditing as more and more users access data within an organization.

@2025 Je suis Curieux