STREAM // Illicit Deep Dive

The Problem: Bitcoin AML at a Regulated Exchange

A regulated exchange with a BitLicense and state money transmitter licenses must screen every transaction for illicit activity. False negatives mean regulatory fines (potentially millions). False positives mean customer friction and support costs. ML automates the triage, not the decision.

The Data: Elliptic Bitcoin Dataset

~200K Bitcoin transactions across 49 timesteps (2 weeks). 166 features per transaction: 1 timestep indicator, 93 local features (inputs, outputs, fees, size), and 72 aggregated features (statistics about neighboring transactions in the payment graph). Labels: licit, illicit, unknown (~10:1 imbalance).

Why Temporal Split (Not Random)

Random train/test split causes data leakage: the model sees future transactions during training. In production, you only have past data. Temporal split (train on timesteps 1-34, test on 35-49) mimics real deployment. This typically reduces apparent performance by 5-15% but gives honest estimates.

Why PR-AUC Over ROC-AUC

With 10:1 class imbalance, ROC-AUC overestimates performance because it credits the model for correctly classifying the abundant negative class. PR-AUC focuses entirely on the rare positive (illicit) class — which is what compliance cares about.

XGBoost: The Production Workhorse

Fast inference (~2ms), SHAP-explainable (regulatory requirement), handles tabular data well. scale_pos_weight handles class imbalance. Early stopping on validation PR-AUC prevents overfitting.

GCN: When Graph Structure Matters

XGBoost sees each transaction in isolation. The GCN sees the transaction graph — who paid whom. This captures money laundering patterns like layering (rapid fund movement through many addresses) that isolated features miss. Tradeoff: slower inference (~50ms), harder to explain to regulators.

Cost-Sensitive Threshold Selection

The default 0.5 threshold minimizes classification error, but not business cost. A missed illicit transaction (FN) costs ~$50K in regulatory risk. A false alarm (FP) costs ~$50 in analyst review time. We optimize the threshold for minimum total business cost, not accuracy.

SHAP -> LLM Narrative Pipeline

SHAP provides feature attributions. Claude API translates those into compliance-ready narratives: 'This transaction was flagged due to aggregated neighbor volume 3.2 SD above mean, combined with unusual output count suggesting fund splitting.' This reduces analyst review time from ~5 min to ~1 min per alert.