Cliff Horizon logo

Calibration Methodology

How the engine validates and maintains probability calibration across all weather variables — the foundation of the entire product stack.

Calibration is the foundation of everything Cliff Horizon sells. If the engine says 70% and the event happens 70% of the time, the warranty pricing works, the derivative pricing works, and the client trusts the analytics. If calibration drifts, everything downstream breaks.

ForecastEx Proving Ground

The engine validates its calibration on ForecastEx — CFTC-regulated binary temperature contracts traded via Interactive Brokers.

Why ForecastEx works as a proving ground:

  • Daily resolution — new contracts settle every day, providing continuous calibration data
  • Money at stake — not a backtest; real financial consequences for miscalibration
  • Market consensus — ForecastEx prices represent the crowd's probability estimate, providing an independent benchmark
  • Auditable — IBKR trade records provide verifiable proof of the engine's trading history

What success looks like:

MetricTargetMeaning
Brier score (overall)< 0.10Strong calibration
Max deviation (any bucket)< +/-5%No systematic over/underconfidence
Reliability diagramPoints within +/-5% of diagonalVisual proof of calibration
Track record length90+ daysSufficient sample size for statistical confidence
MACE< 0.03Mean Absolute Calibration Error

Calibration Results — All Variables

Temperature (Daily High) — Phase 1

Training: Jan–Feb 2026 (590 city-days). Test: March 2026 (310 city-days).

MetricRawCalibratedTarget
Brier score0.03740.0349< 0.10
Isotonic samples12,980 (train)6,820 (test)

Bias correction (additive):

MetricRaw GFSCorrectedImprovement
MAE (°F)1.261.06-16%
Bias (°F)-0.740.00Eliminated
RMSE (°F)1.681.48-12%

Per-city, per-month additive correction. Largest gains: Phoenix (1.42 → 0.61°F), Miami (1.38 → 0.98°F).

Edge calculator backtest:

  • 265 trades passed filters (confidence >= 2, z-score >= 1.0 — relaxed for proxy test)
  • 66.4% win rate
  • Positive sized P&L (+2.57)

Rainfall — Phase 4

Training: Jan–Feb 2026. Test: March 2026.

MetricValue
MAE0.036 inches
Bias-0.0018 inches
Brier (raw)0.0403
Brier (calibrated)0.0292
Bias methodMultiplicative (ratio = mean(obs_wet) / mean(fct_wet))
DistributionGamma (zero-inflated)

The low MAE reflects the fact that most days are dry (precip ≈ 0). The zero-inflated Gamma distribution handles the discrete-continuous nature of rainfall data. Multiplicative correction preserves the zero bound.

Wind Speed — Phase 4

Training: Jan–Feb 2026. Test: March 2026.

MetricValue
MAE7.16 mph
Bias+7.09 mph (GFS overforecasts station-level wind)
Brier (raw)0.0514
Brier (calibrated)0.0462
Bias methodMultiplicative (ratio ~1.61x)
DistributionWeibull

The large positive bias reflects the systematic mismatch between GFS 10m grid-average wind and IEM station-level sustained wind. This is a known scale discrepancy — GFS represents grid-cell averages while ASOS measures point observations. Multiplicative correction with a ~1.61x ratio addresses this effectively.

Irradiance — Phase 4

Training: Jan–Feb 2026. Test: March 2026.

MetricValue
MAE1.49 MJ/m^2
Bias-0.13 MJ/m^2
Brier (raw)0.0530
Brier (calibrated)0.0498
Bias methodMultiplicative
DistributionBeta (on clear-sky index)
Ground truthERA5 reanalysis (independent of forecast models)

Irradiance uses the clear-sky index (CSI = actual / clear-sky) as the probability variable, bounded on [0, 1]. The Beta distribution naturally handles this bounded domain.

Summary — All Variables Pass

VariableBrier (Calibrated)TargetStatus
Temperature (DH)0.0349< 0.10PASS
Rainfall0.0292< 0.10PASS
Wind Speed0.0462< 0.10PASS
Irradiance0.0498< 0.10PASS

Calibration Pipeline — Technical Detail

Step 1: Bias Correction

Two methods, selected per variable:

Additive (temperature):

bias = mean(observed - forecast)   per (station, month)
corrected = raw + bias

Training: train_bias_parameters() in src/models/bias_correction.py. Stores bias and sigma per (station, month) with station-level and global fallbacks.

Multiplicative (rainfall, wind, irradiance):

ratio = mean(observed_wet) / mean(forecast_wet)   per (station, month)
corrected = raw * ratio,   clamped >= 0

Training: train_multiplicative_bias() in src/models/bias_correction.py. Also stores p_zero_obs and p_zero_fct for zero-inflated distributions.

Step 2: Probability Computation

Live (ensemble available): Member counting for all variables.

P(event) = count(members exceeding) / total_members

Backtest (deterministic only): Distribution-specific CDF:

VariableCDFFunction
TemperatureGaussiangaussian_exceedance(mean, sigma, threshold)
RainfallZero-inflated Gammagamma_exceedance(shape, scale, threshold, p_zero)
WindWeibullweibull_exceedance(shape, scale, threshold)
IrradianceBetabeta_exceedance(alpha, beta, threshold)

All implemented in src/core/distribution.py.

Step 3: Isotonic Regression

Maps raw probabilities to calibrated probabilities via a monotone increasing step function:

model = IsotonicRegression(y_min=0.001, y_max=0.999, out_of_bounds="clip")
model.fit(raw_probability, observed_exceeded)
calibrated = model.predict(new_raw_probability)

Trained separately per variable. Implementation: train_isotonic_calibration() and apply_calibration() in src/models/calibration.py.

Step 4: Verification

brier_score = calculate_brier_score(probabilities, outcomes)
reliability_data = calculate_reliability_diagram_data(probabilities, outcomes, n_bins=10)

Both in src/models/calibration.py. Reliability diagram data provides bin-wise predicted-vs-observed frequencies for visual calibration proof.

Calibration Monitoring

The engine monitors calibration in real time via the Calibration tab:

Reliability Diagram

Predicted probability vs observed frequency, with confidence bands. Updated daily as new settlement data arrives. Points should fall within +/-5% of the diagonal.

MACE (Mean Absolute Calibration Error)

The average absolute deviation between predicted probability buckets and observed frequencies. Target: MACE < 0.03. Computed from the reliability diagram bins.

Per-Variable Tracking

Separate calibration monitoring for each registered variable — because calibration quality can differ significantly across variables. Rainfall (Brier 0.0292) is currently better calibrated than wind (Brier 0.0462).

Lead-Time Decay

Calibration quality degrades with lead time. The engine tracks calibration separately for D+0, D+1, D+2–3, and D+4–7 — and adjusts the z-score denominator accordingly via lead-time sigma scaling:

Lead TimeSigma Multiplier
D+00.7x (observations available)
D+11.0x (baseline)
D+21.3x
D+31.6x
D+4+2.0x

Calibration Drift Detection

If calibration metrics deteriorate beyond defined thresholds:

  1. Alert — the Performance tab flags the drift (MACE > 0.03 or Brier exceeds target)
  2. Diagnosis — was it a model change (NWP update), a seasonal shift, or a data quality issue?
  3. Recalibration — isotonic regression curves are re-estimated using the most recent verification data. The train_isotonic_calibration() function is re-run with updated probability-outcome pairs.
  4. Ensemble reweighting — if one NWP source has degraded, its weight is reduced in the multi-model ensemble
  5. Divergence monitoring — the DivergenceMonitor flags >10pp probability shifts between morning and afternoon, catching sudden forecast drift

This self-correction loop is critical for maintaining warranty pricing accuracy. A miscalibrated engine that continues pricing warranties based on stale calibration curves would generate systematic losses for Ensuro's capital pool.

Calibration Artifacts

Each variable's calibration artifacts are stored in data/backtest/{variable}/:

FileContents
forecast_vs_observed.csvRaw paired forecast + observation data
bias_params.jsonTrained bias parameters (additive or multiplicative)
calibration_model.pklFitted IsotonicRegression model (pickled)
phase1_summary.jsonSummary metrics: Brier scores, train/test sizes, bias method, distribution type

The prob_sigma used in z-score computation is derived from the Phase 1 summary: prob_sigma = sqrt(brier_calibrated).

From Calibration to Products

Calibration OutputProduct Use
Reliability diagramSales proof — "here's why you can trust our probabilities"
Brier scoreEnsuro partnership credential — quantified calibration quality
Per-bucket accuracyWarranty threshold setting — defines what deviation is normal vs claimable
Lead-time decay curvesAlert timing — when to trigger early warnings vs when to wait for better data
ForecastEx track recordClient pitch — "we've proven this with real money"
Risk scores (rainfall, wind, irradiance)Commercial products — construction delay, solar shortfall, wind operations
Variable-specific BrierPer-product warranty confidence — rainfall (0.029) vs wind (0.046) quality levels