Here are some projects I’ve worked on that I greatly benefited from.
Climbing board data analysis and modelling (TB2 | Kilter)
I analyzed climbing board data to understand and predict route difficulty using a full data science workflow on large-scale datasets (~130,000 Tension Board climbs, including ~40,000 on TB2, and ~300,000 Kilter Board climbs). The project was structured into two main components:
- exploratory data analysis using SQL, statistical methods, and visualization, and
- predictive modelling via feature engineering and machine learning.
Across both datasets, we observed consistent patterns despite the near absence of user-level data (limited to first ascension timestamps and ascent counts). The analysis phase explored grade distributions, angle effects, temporal trends, and spatial structure. A key component was hold-level analysis: we constructed usage heatmaps and estimated hold difficulty by aggregating climb data and applying Bayesian smoothing to stabilize estimates for infrequently used holds.
In the modelling phase, we compared linear models, Random Forests, and neural networks. A central challenge was the lack of structural information about holds (e.g., crimp, sloper, jug) and, of course, lack of beta, requiring the model to rely purely on spatial structure. Initial models incorporated hold difficulty estimates but suffered from target leakage. After removing these features, we obtained strong and consistent results using only geometric features of climbs, achieving ~70% accuracy within ±1 V-grade and ~90% within ±2 grades. This shows that a substantial portion of climbing difficulty can be explained by spatial configuration and movement constraints alone.
Data science for the linear algebraist
Here the goal is to bridge theoretical linear algebra with practical data science: we show how concepts such as least-squares regression, matrix decompositions, and spectral theory underpin modern machine learning methods. Aimed at readers with a strong linear algebra background, the project translates standard data science terminology into linear algebraic language and develops models from first principles. Using Python (NumPy, pandas, matplotlib), we formulate regression as a matrix problem, emphasizing geometric interpretations, namely, that regression amounts to projecting onto the column space. We then discuss QR decompositions and the SVD, and highlight some things that can go wrong.
Principal component analysis (PCA) follows naturally as an application of low-rank approximation via truncated SVD, and we illustrate the spectral approach concretely through image denoising: truncating small singular values captures structure while removing noise.
Finally, we broaden the perspective to practical modelling. We cover train/test splits and cross-validation, Ridge and Lasso regularization (connecting the $L^2$ and $L^1$ penalties back to condition numbers and the geometry of norm balls); gradient descent and how convergence depends on the spectrum of $X^TX$; nonlinear models via decision trees and random forests; and logistic regression for classification. Feature scaling, hyperparameter tuning, and model interpretation are woven throughout.
Anki tools for language learning
This is a modular toolkit designed to enhance Anki-based language learning on Linux systems. It provides a collection of scripts for
- extracting audio,
- generating text-to-speech content,
- sentence mining, and
- processing YouTube transcripts,
all integrated with AnkiConnect.
The goal is automation and scalability: audio can be extracted and concatenated into playlists for passive listening, sentence lists can be converted into fully voiced Anki cards, and decks can be analyzed to produce frequency-ranked vocabulary using NLP tools such as spaCy and MeCab. The system is designed to be easily extendable to additional languages, and includes utilities for extracting and processing subtitle data into either vocabulary lists or timestamped sentences.