The Cold Start Problem
The trained XGBoost model needs 166 features per transaction. Live mempool transactions only have 3: vsize, fee, fee_rate. You can't just zero-pad 163 features — that produces meaningless scores. So the system starts with a heuristic scorer and learns from analyst feedback to build a production model on the features it actually has.
Cascade Scoring: Honest Degradation
The scoring service tries three models in order: (1) the River online model if it has enough labels, (2) a batch model retrained on accumulated feedback, (3) a heuristic based on fee rate, transaction size, and fee disproportion. Each level is honest about what it is — the UI badges transactions as 'ML' or 'HEURISTIC' so analysts know the confidence level.
score_transaction(vsize, fee, fee_rate): 1. river_predict_one() # online model (if labels exist) 2. learned_model.predict() # batch feedback model 3. heuristic_risk_score() # sigmoid-based fallback
River: Incremental Learning from Analyst Feedback
River is an online ML library — models learn one sample at a time. When an analyst marks a flagged transaction as true positive or false positive, that label is immediately fed to the River model (StandardScaler | LogisticRegression). No batch retraining needed. The model improves with every review, creating a tight human-in-the-loop feedback cycle.
Model Racing: Let Them Compete
Three online classifiers race in parallel: Logistic Regression, Hoeffding Tree, and Gaussian Naive Bayes. Each receives every label and makes predictions. Rolling F1 scores track performance. The system uses F1-weighted ensemble voting — the best-performing model gets the most influence. This answers 'which algorithm works best for this data?' empirically, not theoretically.
Drift Detection: ADWIN + PSI
ADWIN (Adaptive Windowing) detects change points in the prediction score stream — it dynamically grows and shrinks a window to identify when the scoring distribution shifts. PSI (Population Stability Index) compares the current score distribution against a baseline. Together, they catch both sudden regime changes and gradual drift.
Drift monitor (ring buffer of 1000 scores): ADWIN -> change-point detection (sudden shifts) PSI -> distribution comparison (gradual drift) Both -> signal when scoring behavior changes
Adaptation State Machine: Self-Healing
When drift is detected, the system doesn't just alert — it adapts. A state machine governs the process: STABLE -> DRIFT_DETECTED (after 3 consecutive signals, to avoid false alarms) -> ADAPTING (resets ADWIN, PSI baseline, online metrics, model race) -> STABILIZING (waits for new scores to converge) -> RECOVERED -> STABLE. A 60-second cooldown prevents thrashing between states.
STABLE
-> DRIFT_DETECTED (3 consecutive drift signals)
-> ADAPTING (reset all online components)
-> STABILIZING (collect new baseline)
-> RECOVERED (scores stabilized)
-> STABLE (60s cooldown) Why This Matters
Most ML systems are 'train once, deploy, pray.' This system closes the loop: models degrade, drift is detected, adaptation resets the online components, analysts provide ground truth, and the online model rebuilds itself. The entire cycle is automated and observable from the dashboard.