top of page

Data Processing and Storage Pipeline for E-Commerce Behavior Data

This project involves building a big data pipeline to source, process, and visualize data. The pipeline consists of multiple steps, including data sourcing using Python, Kafka for data streaming, Apache Spark for ETL processing, and Tableau for data visualization.

Apache Kafka
AWS
Apache Spark
Tableau
ETL
Batch Processing
bdArchitecture_edited.jpg

Project Overview:

  • End-to-End Big Data Pipeline: Developed a data pipeline using Python for data sourcing, Kafka for real-time streaming, Apache Spark for ETL processes, and Tableau for visualization.

  • Real-Time Data Streaming: Implemented Kafka for efficient data ingestion and real-time processing.

 

Technical Highlights:

  • Data Sourcing and Integration: Integrated large datasets from Kaggle using Python scripts.

  • Apache Kafka: Set up Kafka for real-time data streaming, including managing topics and deploying producer/consumer scripts.

  • Apache Spark ETL: Utilized Spark for complex ETL operations, analyzing user behavior, engagement, and price sensitivity, with outputs in CSV and ORC formats.

  • Tableau Visualization: Created interactive Tableau dashboards for insights into pricing, user interactions, and category-level analysis.

World Energy Consumption Visualized

The 'World Energy Consumption Visualized' project uses interactive visualizations to explore the dynamic connection between global energy consumption and economic prosperity, offering insights through D3.js and Our World in Data's dataset.

D3.js
GitHub Pages
Pandas
EDA
HTML/CSS
energyConsumption.png
image.png

Project Overview:

  • Interactive Data Visualization: Developed an interactive platform to explore the relationship between global energy consumption and economic prosperity using a variety of visualizations.

  • Comprehensive Analysis: Analyzed global GDP distribution, energy consumption patterns, and the energy mix to reveal insights into economic and environmental impacts.

 

Visualizations:

  1. ​World Choropleth Map: Display the global distribution of GDP by color-coding countries.

  2. Pareto Chart: Illustrate the relationship between a country's GDP and energy consumption over time.

  3. Bubble Chart: Compare fossil fuel consumption, renewable energy consumption, and GDP for the top 5 most populous countries.

  4. Stacked Area Chart: Display the composition of energy production for a selected country over time.

Backorder Prediction

I built a high-performing Backorder Prediction System using machine learning. It accurately forecasts stock shortages (90% accuracy) and offers user-friendly interfaces for both individual product checks and bulk prediction via CSV files. The system leverages industry-standard tools like Django and scikit-learn for efficient development and deployment.

Dajngo
Dajngo ORM
Apache Spark
GitLab CI/CD
Model Building
HTML/CSS
image.png

Project Overview:

  • Machine Learning Integration: Developed a Django web application to predict product backorder status using pre-trained machine learning models (Random Forest, Decision Tree, LightGBM).

  • Interactive Interface: Provides both a web form for single product prediction and a batch prediction feature via CSV upload.

 

Key Features:

  • Accurate Predictions: Utilizes advanced machine learning models to predict backorders with high accuracy. Performance metrics for models are provided for comprehensive evaluation.

  • User-Friendly Interface: The application includes a web interface for easy input of product features and an API endpoint for programmatic predictions.

​

Technical Skills:

  • Web Development: Django, HTML, CSS, JavaScript

  • Machine Learning: scikit-learn, LightGBM

  • Data Handling: Batch processing of predictions via CSV files

Person Detection Using Embarrassingly Parallel Computing

This project utilizes parallel processing to improve pedestrian detection in videos. It leverages a pre-trained YOLOv3 model for object detection and harnesses the power of multi-core processors for faster frame processing. The system is designed to be scalable, allowing for efficient processing of large video datasets.

Object Detection
PyTorch
YOLO
Parallel Processing
OpenCV
Linux
image.png

Project Overview:

  • Parallel Computing Implementation: Developed a system to detect persons in video frames using parallel computing techniques, achieving significant speedup and efficiency improvements.

  • Deep Learning Integration: Utilized the YOLOv3 model for accurate person detection, demonstrating expertise in advanced deep learning application.

 

Key Features:

  • Performance Comparison: Implemented and compared serial and parallel versions of the detection pipeline, achieving a speedup of up to 2.42x with parallel processing.

  • Efficiency Analysis: Improved efficiency up to 30.30%, demonstrating effective utilization of multi-core processors.

​

Technical Skills:

  • Parallel Computing: Leveraged Python's multiprocessing to distribute workload across multiple cores.

  • Deep Learning Models: Integrated the YOLOv3 model for object detection.

  • Computer Vision: Used OpenCV for video frame processing.

​

Customer Segmentation Clustering

This project aims to use k-means and Agglomerative clustering to segment customers into different groups based on their characteristics and purchasing habits. The goal is to understand the similarities and differences between the customer segments, which can help inform marketing strategies and target specific groups of customers.

Feature Engineering
K-means
ETL
PCA
Agglomerative Clustering
image.png
image.png

Project Overview:

  • Utilized k-means and Agglomerative clustering to segment customers based on attributes and purchasing behaviors.

  • Aimed at informing targeted marketing and product customization strategies.

 

Key Features:

  • Data Preprocessing: Managed missing values, engineered features, and encoded categorical variables.

  • Dimensionality Reduction: Applied PCA to streamline clustering analysis.

  • Clustering Analysis: Utilized the Elbow Method to determine optimal clusters and conducted Agglomerative Clustering.

  • Result Interpretation: Analyzed distinct customer segments based on demographics and behaviors.

bottom of page