Work

Personal
Projects

A collection of data science, ML, and analytics projects — click any card to explore details and interactive demos.

Featured Machine Learning Random Forest LSTM Interactive

NFL Game Prediction Model

Random Forest and RNN models trained on 6,990 games to predict NFL outcomes. Correctly predicted all 2025 Conference Finals and the Super Bowl — documented on GitHub before kickoff.

Jan 2025 — present
3/32025 correct picks
97.6%test accuracy
6,990games trained
26NFL seasons

I built this project to predict NFL game outcomes using over 35 years of open-source data with 40+ attributes. The attributes were a mix of categorical and numerical features that required careful preprocessing. Models built include Random Forests, XGBoost, Neural Networks, Bayesian Networks, and regression models, each evaluated for suitability to this specific classification task.

The core data challenge: missing values were imputed using linear interpolation (e.g. for venue temperature) rather than dropping rows, preserving valuable historical signal. Categorical variables were one-hot encoded; the target (result: home win, away win, OT) was label encoded. Redundant columns like duplicate QB identifiers were dropped to reduce noise.

The prediction function takes real pre-game conditions like team matchup, game type, surface, weather forecast, and more, and outputs a score differential. Positive means home team wins, negative means away team wins. For the Super Bowl, PHI was treated as the home team based on the official scoreboard ordering (KC–PHI).

Results & model comparison

  • Random Forest was the best overall model — peak 71% accuracy on win prediction, most robust in practice
  • Neural Network achieved precision of 0.8 but lower recall; less reliable than RF in playoff conditions
  • After updating with conference finals data, both models converged on Eagles to win the Super Bowl
  • Naive Bayes and other models were intentionally included to demonstrate the cost of poor model selection
  • Weather attributes (temp, wind), location, and game type all improved Random Forest performance meaningfully
Verified predictions — timestamped on GitHub before kickoff
DateMatchupPredictedActualModelResult
Jan 19, 2025 WAS @ PHI — Conf. Finals PHI wins PHI wins Random Forest ✓ correct
Jan 19, 2025 BUF @ KC — Conf. Finals KC wins KC wins Random Forest ✓ correct
Feb 9, 2025 KC @ PHI — Super Bowl LIX PHI wins PHI wins RNN ✓ correct
Live game predictor

checking API...

Select two teams and run a prediction

Predicted winner

AWAY HOME

Team win rates (2019–2024)
Model accuracy by season (1999–2024)
PythonScikit-learnTensorFlow/Keras XGBoostPandasNumPy FastAPIChart.js
Data Analytics Sports Visualization

NBA Player Performance Analysis

Game-level and series-level player analysis across the 2025 NBA Playoffs. Compared Tyrese Haliburton and Shai Gilgeous-Alexander across key performance indicators with injury and burnout assessment.

2025

This analysis focused on the 2025 NBA Finals series between Tyrese Haliburton and Shai Gilgeous-Alexander. Using game-by-game playoff logs, I built visualizations with rolling 3-game averages across key performance indicators to track how each player trended through the series.

The analysis was extended to include regular season data as a baseline, assessing whether either player showed signs of fatigue or performance drop-off that could signal injury risk heading into the Finals.

Most recent playoff game comparison

Most recent playoff game comparison across all key stats — SGA led in points, rebounds, and defensive metrics

Points per game with 3-game rolling average

Points per game with 3-game rolling average — Haliburton's trend declined sharply through the series

Total rebounds per game

Total rebounds — SGA's rebounding trended upward as the series progressed

Steals per game

Steals — SGA peaked at 3 in game 2; both players declined toward the end of the series

Blocks per game

Blocks — SGA consistently contributed defensively; Haliburton near zero throughout

Key findings

  • SGA outperformed Haliburton in points, rebounds, steals, and blocks across nearly all games in the series
  • Haliburton's 3-game rolling average for points declined steadily — from ~20 early to under 10 by the end
  • SGA's rolling average remained elevated throughout, suggesting no late-series fatigue
  • Regular season vs playoff comparison revealed no significant injury-risk signal for either player
PythonPandasMatplotlibPlotlyGoogle Colab
Geospatial Forecasting Policy QGIS

Carbon Emissions Reduction — Mesa, AZ

Forecasted emissions and temperature trends, identified key drivers of declining public transit ridership, and redesigned bus routes impacting 63,000+ residents with an estimated 8,000 tons of annual carbon reduction.

Jan 2025 — May 2025
63k+residents impacted
8,000tonnes CO₂ saved / yr
22k+tonnes if electric buses
$540kestimated infrastructure cost

This project addressed the intersection of climate change, public transit decline, and urban planning in Mesa, AZ. All data was publicly available through the City of Mesa’s website, Valley Metro’s budget reports, and the Open-Meteo weather API.

Temperature forecasting: LSTM and ARIMA models were trained on 10+ years of historical data. The optimal LSTM used 2 layers, 100 units, Adam optimizer, trained for 20 epochs, achieving MSE and RMSE both under 0.38, meaning less than 0.4°C deviation from actuals. The 30-day forecast (shown below) projects continued seasonal decline consistent with winter 2025 trends.

Ridership analysis: Valley Metro data showed ridership collapsed from 283k riders in February 2020 to just 154k in April 2020 and never recovered to pre-pandemic levels. This was attributed to a combination of the pandemic, the shift to remote work, and persistently higher unemployment.

Geospatial work: Using QGIS 3.4, population estimates were joined to census tract shapefiles using the GEOID key to visualize density. Tracts with no service overlap were isolated. For example, the eastern region of Mesa was heavily underserved despite high population density (dark blocks in the map below).

Route design: 5 new bus routes were proposed covering underserved areas with 45 new stops, minimizing redundant coverage. A 500m walkability buffer confirmed over 63,000 residents would gain access within a 5-minute walk.

LSTM temperature forecast for Mesa, AZ

LSTM 30-day average temperature forecast — MSE < 0.38, trained on 10+ years of Open-Meteo data

Bus ridership by route and QGIS service map

Left: monthly bus ridership per route showing post-pandemic decline. Right: QGIS service coverage map

Underserved census tracts in eastern Mesa

Underserved census tracts — dark blocks show population with no transit overlap

Proposed new bus routes

5 proposed new routes (colored lines) covering gaps in eastern Mesa

Zoomed proposed routes with walkability buffers

Zoomed view of proposed routes with 500m walkability buffers and bus stop vertices

Financial & environmental impact

  • Mesa’s transit budget allocated only $2.7M of $158M to personnel, with many pass readers non-functional
  • 45 new bus stops at $12k each totals $540k — less than 0.15% reallocation of the city’s foreign aid budget
  • Assuming 15% adoption among affected residents: ~8,000 tonnes CO₂ saved per year (diesel buses)
  • With electric buses: over 22,000 tonnes saved — equivalent to eliminating 20M miles of gas car travel or planting 400k trees
PythonQGIS 3.4TensorFlow / LSTMARIMA PandasMatplotlibOpen-Meteo APIMicrosoft Excel
Big Data SparkSQL Geospatial

Urban Mobility Hotspot Analysis

Large-scale spatial queries and hotspot analysis on high-volume transit datasets using SparkSQL. Identified statistically significant passenger density clusters across urban transit networks.

Oct 2024 — Dec 2024

This project applied big data techniques to urban transit analysis — specifically using SparkSQL range joins to efficiently query millions of passenger records and identify spatial hotspots where demand was highest relative to supply.

The core challenge was performance: naive spatial joins on large datasets are prohibitively slow. Using Spark's distributed compute and optimized range join strategies, I brought query times down to a practical level while maintaining statistical rigor in the hotspot identification.

Key findings

  • Identified statistically significant hotspots using Getis-Ord Gi* spatial statistics
  • Range joins on datasets of millions of records completed in minutes on distributed compute
  • Hotspot clusters showed strong correlation with employment centers and transit transfer points
SparkSQLApache SparkPythonPySparkGeospatial analysis