Churn and tenure prediction on 624k users
Which of 624,000 users will leave — and when
- ROLE
- Solo build
- TIMEFRAME
- 2025
- STACK
- Python, LightGBM, CatBoost, XGBoost, scikit-learn
- LINKS
- github ↗
0.88 AUC
0.83 RECALL ON TEST
2.31
Tenure MAE
0.939
Tenure R²
The problem
Predict which users will churn — early enough to act, from 90 days of anonymized usage logs (624,048 training rows), without letting the future leak into the features.
Approach
Only days 1–60 feed the features, so the model never sees the window it's judged on. Engineered signals include rolling EMAs and volatilities (7/14/30-day), a usage slope, drop trend, coefficient of variation, and zero-activity days. A RandomForest regressor first predicts expected tenure (test MAE 2.31, R² 0.939); its prediction — never the actual — then becomes a feature for the churn classifier. The final classifier is a voting ensemble over random forest, LightGBM, CatBoost, XGBoost, and logistic regression.
Results
Test set: AUC-ROC 0.88, recall 0.83 at 0.81 accuracy. Recall is the metric that matters here — a missed churner costs more than a retention nudge sent to a happy user. The strongest features were the engineered trend signals (drop_trend first), not the raw usage counts.
What broke
[Which leakage paths you caught late, what didn't survive — e.g. did SMOTE/threshold tuning help or hurt?]