Calibration Methodology

Calibration is the foundation of everything Cliff Horizon sells. If the engine says 70% and the event happens 70% of the time, the warranty pricing works, the derivative pricing works, and the client trusts the analytics. If calibration drifts, everything downstream breaks.

ForecastEx Proving Ground

The engine validates its calibration on ForecastEx — CFTC-regulated binary temperature contracts traded via Interactive Brokers.

Why ForecastEx works as a proving ground:

Daily resolution — new contracts settle every day, providing continuous calibration data
Money at stake — not a backtest; real financial consequences for miscalibration
Market consensus — ForecastEx prices represent the crowd's probability estimate, providing an independent benchmark
Auditable — IBKR trade records provide verifiable proof of the engine's trading history

What success looks like:

Metric	Target	Meaning
Brier score (overall)	< 0.10	Strong calibration
Max deviation (any bucket)	< +/-5%	No systematic over/underconfidence
Reliability diagram	Points within +/-5% of diagonal	Visual proof of calibration
Track record length	90+ days	Sufficient sample size for statistical confidence
MACE	< 0.03	Mean Absolute Calibration Error

Calibration Results — All Variables

Temperature (Daily High) — Phase 1

Training: Jan–Feb 2026 (590 city-days). Test: March 2026 (310 city-days).

Metric	Raw	Calibrated	Target
Brier score	0.0374	0.0349	< 0.10
Isotonic samples	12,980 (train)	6,820 (test)	—

Bias correction (additive):

Metric	Raw GFS	Corrected	Improvement
MAE (°F)	1.26	1.06	-16%
Bias (°F)	-0.74	0.00	Eliminated
RMSE (°F)	1.68	1.48	-12%

Per-city, per-month additive correction. Largest gains: Phoenix (1.42 → 0.61°F), Miami (1.38 → 0.98°F).

Edge calculator backtest:

265 trades passed filters (confidence >= 2, z-score >= 1.0 — relaxed for proxy test)
66.4% win rate
Positive sized P&L (+2.57)

Rainfall — Phase 4

Training: Jan–Feb 2026. Test: March 2026.

Metric	Value
MAE	0.036 inches
Bias	-0.0018 inches
Brier (raw)	0.0403
Brier (calibrated)	0.0292
Bias method	Multiplicative (ratio = mean(obs_wet) / mean(fct_wet))
Distribution	Gamma (zero-inflated)

The low MAE reflects the fact that most days are dry (precip ≈ 0). The zero-inflated Gamma distribution handles the discrete-continuous nature of rainfall data. Multiplicative correction preserves the zero bound.

Wind Speed — Phase 4

Training: Jan–Feb 2026. Test: March 2026.

Metric	Value
MAE	7.16 mph
Bias	+7.09 mph (GFS overforecasts station-level wind)
Brier (raw)	0.0514
Brier (calibrated)	0.0462
Bias method	Multiplicative (ratio ~1.61x)
Distribution	Weibull

The large positive bias reflects the systematic mismatch between GFS 10m grid-average wind and IEM station-level sustained wind. This is a known scale discrepancy — GFS represents grid-cell averages while ASOS measures point observations. Multiplicative correction with a ~1.61x ratio addresses this effectively.

Irradiance — Phase 4

Training: Jan–Feb 2026. Test: March 2026.

Metric	Value
MAE	1.49 MJ/m^2
Bias	-0.13 MJ/m^2
Brier (raw)	0.0530
Brier (calibrated)	0.0498
Bias method	Multiplicative
Distribution	Beta (on clear-sky index)
Ground truth	ERA5 reanalysis (independent of forecast models)

Irradiance uses the clear-sky index (CSI = actual / clear-sky) as the probability variable, bounded on [0, 1]. The Beta distribution naturally handles this bounded domain.

Summary — All Variables Pass

Variable	Brier (Calibrated)	Target	Status
Temperature (DH)	0.0349	< 0.10	PASS
Rainfall	0.0292	< 0.10	PASS
Wind Speed	0.0462	< 0.10	PASS
Irradiance	0.0498	< 0.10	PASS

Calibration Pipeline — Technical Detail

Step 1: Bias Correction

Two methods, selected per variable:

Additive (temperature):

bias = mean(observed - forecast)   per (station, month)
corrected = raw + bias

Training: train_bias_parameters() in src/models/bias_correction.py. Stores bias and sigma per (station, month) with station-level and global fallbacks.

Multiplicative (rainfall, wind, irradiance):

ratio = mean(observed_wet) / mean(forecast_wet)   per (station, month)
corrected = raw * ratio,   clamped >= 0

Training: train_multiplicative_bias() in src/models/bias_correction.py. Also stores p_zero_obs and p_zero_fct for zero-inflated distributions.

Step 2: Probability Computation

Live (ensemble available): Member counting for all variables.

P(event) = count(members exceeding) / total_members

Backtest (deterministic only): Distribution-specific CDF:

Variable	CDF	Function
Temperature	Gaussian	`gaussian_exceedance(mean, sigma, threshold)`
Rainfall	Zero-inflated Gamma	`gamma_exceedance(shape, scale, threshold, p_zero)`
Wind	Weibull	`weibull_exceedance(shape, scale, threshold)`
Irradiance	Beta	`beta_exceedance(alpha, beta, threshold)`

All implemented in src/core/distribution.py.

Step 3: Isotonic Regression

Maps raw probabilities to calibrated probabilities via a monotone increasing step function:

model = IsotonicRegression(y_min=0.001, y_max=0.999, out_of_bounds="clip")
model.fit(raw_probability, observed_exceeded)
calibrated = model.predict(new_raw_probability)

Trained separately per variable. Implementation: train_isotonic_calibration() and apply_calibration() in src/models/calibration.py.

Step 4: Verification

brier_score = calculate_brier_score(probabilities, outcomes)
reliability_data = calculate_reliability_diagram_data(probabilities, outcomes, n_bins=10)

Both in src/models/calibration.py. Reliability diagram data provides bin-wise predicted-vs-observed frequencies for visual calibration proof.

Calibration Monitoring

The engine monitors calibration in real time via the Calibration tab:

Reliability Diagram

Predicted probability vs observed frequency, with confidence bands. Updated daily as new settlement data arrives. Points should fall within +/-5% of the diagonal.

MACE (Mean Absolute Calibration Error)

The average absolute deviation between predicted probability buckets and observed frequencies. Target: MACE < 0.03. Computed from the reliability diagram bins.

Per-Variable Tracking

Separate calibration monitoring for each registered variable — because calibration quality can differ significantly across variables. Rainfall (Brier 0.0292) is currently better calibrated than wind (Brier 0.0462).

Lead-Time Decay

Calibration quality degrades with lead time. The engine tracks calibration separately for D+0, D+1, D+2–3, and D+4–7 — and adjusts the z-score denominator accordingly via lead-time sigma scaling:

Lead Time	Sigma Multiplier
D+0	0.7x (observations available)
D+1	1.0x (baseline)
D+2	1.3x
D+3	1.6x
D+4+	2.0x

Calibration Drift Detection

If calibration metrics deteriorate beyond defined thresholds:

Alert — the Performance tab flags the drift (MACE > 0.03 or Brier exceeds target)
Diagnosis — was it a model change (NWP update), a seasonal shift, or a data quality issue?
Recalibration — isotonic regression curves are re-estimated using the most recent verification data. The train_isotonic_calibration() function is re-run with updated probability-outcome pairs.
Ensemble reweighting — if one NWP source has degraded, its weight is reduced in the multi-model ensemble
Divergence monitoring — the DivergenceMonitor flags >10pp probability shifts between morning and afternoon, catching sudden forecast drift

This self-correction loop is critical for maintaining warranty pricing accuracy. A miscalibrated engine that continues pricing warranties based on stale calibration curves would generate systematic losses for Ensuro's capital pool.

Calibration Artifacts

Each variable's calibration artifacts are stored in data/backtest/{variable}/:

File	Contents
`forecast_vs_observed.csv`	Raw paired forecast + observation data
`bias_params.json`	Trained bias parameters (additive or multiplicative)
`calibration_model.pkl`	Fitted `IsotonicRegression` model (pickled)
`phase1_summary.json`	Summary metrics: Brier scores, train/test sizes, bias method, distribution type

The prob_sigma used in z-score computation is derived from the Phase 1 summary: prob_sigma = sqrt(brier_calibrated).

From Calibration to Products

Calibration Output	Product Use
Reliability diagram	Sales proof — "here's why you can trust our probabilities"
Brier score	Ensuro partnership credential — quantified calibration quality
Per-bucket accuracy	Warranty threshold setting — defines what deviation is normal vs claimable
Lead-time decay curves	Alert timing — when to trigger early warnings vs when to wait for better data
ForecastEx track record	Client pitch — "we've proven this with real money"
Risk scores (rainfall, wind, irradiance)	Commercial products — construction delay, solar shortfall, wind operations
Variable-specific Brier	Per-product warranty confidence — rainfall (0.029) vs wind (0.046) quality levels