This is a data pipeline which consists of scraping football data from the different wesites and ingesting into the cloud with Google Cloud Storage and locally into PostgreSQL. We preceed to define a series of web scraping functions with beautiful to scrape data which is converted into pandas dataframe and later into parquet format which is ingested into Google Cloud Storage which is Google's Data Lake storage service. This project is a simple implementation of data pipeline involving web scraping from a static HTML webpage and ingested into the cloud storage with no orchestration tool.
This project implements an end to end data pipeline which is dockerized with data consumed in JSON format from an API service. This project demonstrates key concepts such as using data orchestration tools with data processing with Apache Spark. To ensure facilitate depolyment, scalability and reproducibility we use Docker to containerize the data pipeline. ?
This project implements the concept of Change Data Capture in realtime with Postgres, Debezium, Kafka and and Apache Spark. The importance of CDC is to provide a robust mechanism for data governance and auditing as more and more users access data within an organization.