STREAM // Online ML

ONLINE ML & ADAPTATION // DEEP DIVE[ TECHNICAL ]

The Cold Start Problem

The trained XGBoost model needs 166 features per transaction. Live mempool transactions only have 3: vsize, fee, fee_rate. You can't just zero-pad 163 features — that produces meaningless scores. So the system starts with a heuristic scorer and learns from analyst feedback to build a production model on the features it actually has.

Cascade Scoring: Honest Degradation

The scoring service tries three models in order: (1) the River online model if it has enough labels, (2) a batch model retrained on accumulated feedback, (3) a heuristic based on fee rate, transaction size, and fee disproportion. Each level is honest about what it is — the UI badges transactions as 'ML' or 'HEURISTIC' so analysts know the confidence level.

score_transaction(vsize, fee, fee_rate):
  1. river_predict_one()    # online model (if labels exist)
  2. learned_model.predict() # batch feedback model
  3. heuristic_risk_score()  # sigmoid-based fallback

River: Incremental Learning from Analyst Feedback

River is an online ML library — models learn one sample at a time. When an analyst marks a flagged transaction as true positive or false positive, that label is immediately fed to the River model (StandardScaler | LogisticRegression). No batch retraining needed. The model improves with every review, creating a tight human-in-the-loop feedback cycle.

Model Racing: Let Them Compete

Three online classifiers race in parallel: Logistic Regression, Hoeffding Tree, and Gaussian Naive Bayes. Each receives every label and makes predictions. Rolling F1 scores track performance. The system uses F1-weighted ensemble voting — the best-performing model gets the most influence. This answers 'which algorithm works best for this data?' empirically, not theoretically.

Drift Detection: ADWIN + PSI

ADWIN (Adaptive Windowing) detects change points in the prediction score stream — it dynamically grows and shrinks a window to identify when the scoring distribution shifts. PSI (Population Stability Index) compares the current score distribution against a baseline. Together, they catch both sudden regime changes and gradual drift.

Drift monitor (ring buffer of 1000 scores):
  ADWIN  -> change-point detection (sudden shifts)
  PSI    -> distribution comparison (gradual drift)
  Both   -> signal when scoring behavior changes

Adaptation State Machine: Self-Healing

When drift is detected, the system doesn't just alert — it adapts. A state machine governs the process: STABLE -> DRIFT_DETECTED (after 3 consecutive signals, to avoid false alarms) -> ADAPTING (resets ADWIN, PSI baseline, online metrics, model race) -> STABILIZING (waits for new scores to converge) -> RECOVERED -> STABLE. A 60-second cooldown prevents thrashing between states.

STABLE
  -> DRIFT_DETECTED (3 consecutive drift signals)
    -> ADAPTING (reset all online components)
      -> STABILIZING (collect new baseline)
        -> RECOVERED (scores stabilized)
          -> STABLE (60s cooldown)

Why This Matters

Most ML systems are 'train once, deploy, pray.' This system closes the loop: models degrade, drift is detected, adaptation resets the online components, analysts provide ground truth, and the online model rebuilds itself. The entire cycle is automated and observable from the dashboard.