Here are some projects I’ve worked on that I greatly benefited from.
Climbing board data analysis and modelling (TB2 | Kilter)
I analyzed climbing board data to understand and predict route difficulty using a full data science workflow on large-scale datasets (~130,000 Tension Board climbs, including ~40,000 on TB2, and ~300,000 Kilter Board climbs). The project was structured into two main components:
- exploratory data analysis using SQL, statistical methods, and visualization, and
- predictive modelling via feature engineering and machine learning.
Across both datasets, we observed consistent patterns despite the near absence of user-level data (limited to first ascension timestamps and ascent counts). The analysis phase explored grade distributions, angle effects, temporal trends, and spatial structure. A key component was hold-level analysis: we constructed usage heatmaps and estimated hold difficulty by aggregating climb data and applying Bayesian smoothing to stabilize estimates for infrequently used holds.
In the modelling phase, we compared linear models, Random Forests, and neural networks. A central challenge was the lack of structural information about holds (e.g., crimp, sloper, jug) and, of course, lack of beta, requiring the model to rely purely on spatial structure. Initial models incorporated hold difficulty estimates but suffered from target leakage. After removing these features, we obtained strong and consistent results using only geometric features of climbs, achieving ~70% accuracy within ±1 V-grade and ~90% within ±2 grades. This shows that a substantial portion of climbing difficulty can be explained by spatial configuration and movement constraints alone.
Data science for the linear algebraist (gitlab)
Here the goal is to bridge theoretical linear algebra with practical data science: we show how concepts such as least-squares regression, matrix decompositions, and spectral theory underpin modern machine learning methods. Aimed at readers with a strong linear algebra background, the project translates standard data science terminology into linear algebraic language and develops models from first principles. Using Python (NumPy, pandas, matplotlib), we formulate regression as a matrix problem, emphasizing geometric interpretations, namely, that regression amounts to projecting onto the column space.
Beyond basic regression, we explore numerical methods including QR decomposition and the singular value decomposition (SVD), highlighting why naive approaches,such as looking at normal equations, are subpar in practice. The project concludes with an application to image denoising via truncated SVD, illustrating how spectral methods and low-rank approximations capture structure while effectively removing noise.
Anki tools for language learning (gitlab)
This is a modular toolkit designed to enhance Anki-based language learning on Linux systems. It provides a collection of scripts for
- extracting audio,
- generating text-to-speech content,
- sentence mining, and
- processing YouTube transcripts,
all integrated with AnkiConnect.
The goal is automation and scalability: audio can be extracted and concatenated into playlists for passive listening, sentence lists can be converted into fully voiced Anki cards, and decks can be analyzed to produce frequency-ranked vocabulary using NLP tools such as spaCy and MeCab. The system is designed to be easily extendable to additional languages, and includes utilities for extracting and processing subtitle data into either vocabulary lists or timestamped sentences.