An end-to-end data pipeline built using Amazon AWS.

This project implements a data pipeline using Amazon Web Services (AWS) to process and analyze a Flight Dataset. The pipeline collects raw data, processes it, stores the processed data in a data warehouse, and performs analysis using SQL queries. The analysis results are visualized dynamically using Power BI dashboards.
The data pipeline consists of several stages:
- Data Collection: Raw flight data is retrieved from an S3 bucket (
my-flights-raw-data
) using AWS S3 client. - Data Processing: The data undergoes cleaning and transformation:
- Unwanted columns are dropped.
- Airport details (city, state) are extracted from the
airport_name
field. - Carrier and airport tables are created with unique IDs for efficient querying.
- Date information is structured to facilitate time-series analysis.
- Data Storage: Processed data (flights, carriers, airports, dates) is converted to CSV format and uploaded to an S3 bucket (
my-flights-processed-data
).
- Storage: AWS S3 is used as a data warehouse for storing the processed data tables (
flights.csv
,carriers.csv
,airports.csv
,dates.csv
).
- SQL Analysis: The processed data is queried using SQL to derive insights such as:
- Total number of flights and delays.
- Delay reasons (carrier delay, weather delay, etc.).
- Airport-wise and carrier-wise performance metrics.
- Power BI Dashboard: The analysis results are visualized dynamically using Power BI:
- Dashboards are created to visualize trends, patterns, and key metrics derived from the SQL analysis.
- Interactive visualizations allow users to explore flight performance data across different dimensions (time, carrier, airport).