Calibration Methodology
How the engine validates and maintains probability calibration across all weather variables — the foundation of the entire product stack.
Calibration is the foundation of everything Cliff Horizon sells. If the engine says 70% and the event happens 70% of the time, the warranty pricing works, the derivative pricing works, and the client trusts the analytics. If calibration drifts, everything downstream breaks.
ForecastEx Proving Ground
The engine validates its calibration on ForecastEx — CFTC-regulated binary temperature contracts traded via Interactive Brokers.
Why ForecastEx works as a proving ground:
- Daily resolution — new contracts settle every day, providing continuous calibration data
- Money at stake — not a backtest; real financial consequences for miscalibration
- Market consensus — ForecastEx prices represent the crowd's probability estimate, providing an independent benchmark
- Auditable — IBKR trade records provide verifiable proof of the engine's trading history
What success looks like:
| Metric | Target | Meaning |
|---|---|---|
| Brier score (overall) | < 0.10 | Strong calibration |
| Max deviation (any bucket) | < +/-5% | No systematic over/underconfidence |
| Reliability diagram | Points within +/-5% of diagonal | Visual proof of calibration |
| Track record length | 90+ days | Sufficient sample size for statistical confidence |
| MACE | < 0.03 | Mean Absolute Calibration Error |
Calibration Results — All Variables
Temperature (Daily High) — Phase 1
Training: Jan–Feb 2026 (590 city-days). Test: March 2026 (310 city-days).
| Metric | Raw | Calibrated | Target |
|---|---|---|---|
| Brier score | 0.0374 | 0.0349 | < 0.10 |
| Isotonic samples | 12,980 (train) | 6,820 (test) | — |
Bias correction (additive):
| Metric | Raw GFS | Corrected | Improvement |
|---|---|---|---|
| MAE (°F) | 1.26 | 1.06 | -16% |
| Bias (°F) | -0.74 | 0.00 | Eliminated |
| RMSE (°F) | 1.68 | 1.48 | -12% |
Per-city, per-month additive correction. Largest gains: Phoenix (1.42 → 0.61°F), Miami (1.38 → 0.98°F).
Edge calculator backtest:
- 265 trades passed filters (confidence >= 2, z-score >= 1.0 — relaxed for proxy test)
- 66.4% win rate
- Positive sized P&L (+2.57)
Rainfall — Phase 4
Training: Jan–Feb 2026. Test: March 2026.
| Metric | Value |
|---|---|
| MAE | 0.036 inches |
| Bias | -0.0018 inches |
| Brier (raw) | 0.0403 |
| Brier (calibrated) | 0.0292 |
| Bias method | Multiplicative (ratio = mean(obs_wet) / mean(fct_wet)) |
| Distribution | Gamma (zero-inflated) |
The low MAE reflects the fact that most days are dry (precip ≈ 0). The zero-inflated Gamma distribution handles the discrete-continuous nature of rainfall data. Multiplicative correction preserves the zero bound.
Wind Speed — Phase 4
Training: Jan–Feb 2026. Test: March 2026.
| Metric | Value |
|---|---|
| MAE | 7.16 mph |
| Bias | +7.09 mph (GFS overforecasts station-level wind) |
| Brier (raw) | 0.0514 |
| Brier (calibrated) | 0.0462 |
| Bias method | Multiplicative (ratio ~1.61x) |
| Distribution | Weibull |
The large positive bias reflects the systematic mismatch between GFS 10m grid-average wind and IEM station-level sustained wind. This is a known scale discrepancy — GFS represents grid-cell averages while ASOS measures point observations. Multiplicative correction with a ~1.61x ratio addresses this effectively.
Irradiance — Phase 4
Training: Jan–Feb 2026. Test: March 2026.
| Metric | Value |
|---|---|
| MAE | 1.49 MJ/m^2 |
| Bias | -0.13 MJ/m^2 |
| Brier (raw) | 0.0530 |
| Brier (calibrated) | 0.0498 |
| Bias method | Multiplicative |
| Distribution | Beta (on clear-sky index) |
| Ground truth | ERA5 reanalysis (independent of forecast models) |
Irradiance uses the clear-sky index (CSI = actual / clear-sky) as the probability variable, bounded on [0, 1]. The Beta distribution naturally handles this bounded domain.
Summary — All Variables Pass
| Variable | Brier (Calibrated) | Target | Status |
|---|---|---|---|
| Temperature (DH) | 0.0349 | < 0.10 | PASS |
| Rainfall | 0.0292 | < 0.10 | PASS |
| Wind Speed | 0.0462 | < 0.10 | PASS |
| Irradiance | 0.0498 | < 0.10 | PASS |
Calibration Pipeline — Technical Detail
Step 1: Bias Correction
Two methods, selected per variable:
Additive (temperature):
bias = mean(observed - forecast) per (station, month)
corrected = raw + bias
Training: train_bias_parameters() in src/models/bias_correction.py. Stores bias and sigma per (station, month) with station-level and global fallbacks.
Multiplicative (rainfall, wind, irradiance):
ratio = mean(observed_wet) / mean(forecast_wet) per (station, month)
corrected = raw * ratio, clamped >= 0
Training: train_multiplicative_bias() in src/models/bias_correction.py. Also stores p_zero_obs and p_zero_fct for zero-inflated distributions.
Step 2: Probability Computation
Live (ensemble available): Member counting for all variables.
P(event) = count(members exceeding) / total_members
Backtest (deterministic only): Distribution-specific CDF:
| Variable | CDF | Function |
|---|---|---|
| Temperature | Gaussian | gaussian_exceedance(mean, sigma, threshold) |
| Rainfall | Zero-inflated Gamma | gamma_exceedance(shape, scale, threshold, p_zero) |
| Wind | Weibull | weibull_exceedance(shape, scale, threshold) |
| Irradiance | Beta | beta_exceedance(alpha, beta, threshold) |
All implemented in src/core/distribution.py.
Step 3: Isotonic Regression
Maps raw probabilities to calibrated probabilities via a monotone increasing step function:
model = IsotonicRegression(y_min=0.001, y_max=0.999, out_of_bounds="clip")
model.fit(raw_probability, observed_exceeded)
calibrated = model.predict(new_raw_probability)
Trained separately per variable. Implementation: train_isotonic_calibration() and apply_calibration() in src/models/calibration.py.
Step 4: Verification
brier_score = calculate_brier_score(probabilities, outcomes)
reliability_data = calculate_reliability_diagram_data(probabilities, outcomes, n_bins=10)
Both in src/models/calibration.py. Reliability diagram data provides bin-wise predicted-vs-observed frequencies for visual calibration proof.
Calibration Monitoring
The engine monitors calibration in real time via the Calibration tab:
Reliability Diagram
Predicted probability vs observed frequency, with confidence bands. Updated daily as new settlement data arrives. Points should fall within +/-5% of the diagonal.
MACE (Mean Absolute Calibration Error)
The average absolute deviation between predicted probability buckets and observed frequencies. Target: MACE < 0.03. Computed from the reliability diagram bins.
Per-Variable Tracking
Separate calibration monitoring for each registered variable — because calibration quality can differ significantly across variables. Rainfall (Brier 0.0292) is currently better calibrated than wind (Brier 0.0462).
Lead-Time Decay
Calibration quality degrades with lead time. The engine tracks calibration separately for D+0, D+1, D+2–3, and D+4–7 — and adjusts the z-score denominator accordingly via lead-time sigma scaling:
| Lead Time | Sigma Multiplier |
|---|---|
| D+0 | 0.7x (observations available) |
| D+1 | 1.0x (baseline) |
| D+2 | 1.3x |
| D+3 | 1.6x |
| D+4+ | 2.0x |
Calibration Drift Detection
If calibration metrics deteriorate beyond defined thresholds:
- Alert — the Performance tab flags the drift (MACE > 0.03 or Brier exceeds target)
- Diagnosis — was it a model change (NWP update), a seasonal shift, or a data quality issue?
- Recalibration — isotonic regression curves are re-estimated using the most recent verification data. The
train_isotonic_calibration()function is re-run with updated probability-outcome pairs. - Ensemble reweighting — if one NWP source has degraded, its weight is reduced in the multi-model ensemble
- Divergence monitoring — the
DivergenceMonitorflags >10pp probability shifts between morning and afternoon, catching sudden forecast drift
This self-correction loop is critical for maintaining warranty pricing accuracy. A miscalibrated engine that continues pricing warranties based on stale calibration curves would generate systematic losses for Ensuro's capital pool.
Calibration Artifacts
Each variable's calibration artifacts are stored in data/backtest/{variable}/:
| File | Contents |
|---|---|
forecast_vs_observed.csv | Raw paired forecast + observation data |
bias_params.json | Trained bias parameters (additive or multiplicative) |
calibration_model.pkl | Fitted IsotonicRegression model (pickled) |
phase1_summary.json | Summary metrics: Brier scores, train/test sizes, bias method, distribution type |
The prob_sigma used in z-score computation is derived from the Phase 1 summary: prob_sigma = sqrt(brier_calibrated).
From Calibration to Products
| Calibration Output | Product Use |
|---|---|
| Reliability diagram | Sales proof — "here's why you can trust our probabilities" |
| Brier score | Ensuro partnership credential — quantified calibration quality |
| Per-bucket accuracy | Warranty threshold setting — defines what deviation is normal vs claimable |
| Lead-time decay curves | Alert timing — when to trigger early warnings vs when to wait for better data |
| ForecastEx track record | Client pitch — "we've proven this with real money" |
| Risk scores (rainfall, wind, irradiance) | Commercial products — construction delay, solar shortfall, wind operations |
| Variable-specific Brier | Per-product warranty confidence — rainfall (0.029) vs wind (0.046) quality levels |