Churn and tenure prediction on 624k users

ROLE: Solo build
TIMEFRAME: 2025
STACK: Python, LightGBM, CatBoost, XGBoost, scikit-learn
LINKS: github ↗

0.88 AUC

0.83 RECALL ON TEST

2.31

Tenure MAE

0.939

Tenure R²

The problem

Predict which users will churn — early enough to act, from 90 days of anonymized usage logs (624,048 training rows), without letting the future leak into the features.

Approach

Only days 1–60 feed the features, so the model never sees the window it's judged on. Engineered signals include rolling EMAs and volatilities (7/14/30-day), a usage slope, drop trend, coefficient of variation, and zero-activity days. A RandomForest regressor first predicts expected tenure (test MAE 2.31, R² 0.939); its prediction — never the actual — then becomes a feature for the churn classifier. The final classifier is a voting ensemble over random forest, LightGBM, CatBoost, XGBoost, and logistic regression.

Results

Test set: AUC-ROC 0.88, recall 0.83 at 0.81 accuracy. Recall is the metric that matters here — a missed churner costs more than a retention nudge sent to a happy user. The strongest features were the engineered trend signals (drop_trend first), not the raw usage counts.

What broke

[Which leakage paths you caught late, what didn't survive — e.g. did SMOTE/threshold tuning help or hurt?]