Methodology
Our goal is to provide a transparent, repeatable way to evaluate probabilistic forecasts. We use the Brier Score as our primary metric because it is a strictly proper scoring rule. That means: given a fixed set of questions and fixed scoring rules, your best long run strategy is to report your true probabilities. However, leaderboards can still be gamed through selection (which questions you forecast) and timing (when you forecast). That is why we publish checkpoint rules, coverage, and benchmark definitions below.
0. What we score (the scorecard rules)
A score is only meaningful if the rules are clear. Our scorecards disclose:
- Eligible set: which markets or questions are included (and which are excluded).
- Checkpoint rule: which forecast is scored (to avoid rewarding last-minute updates).
- Weighting: how forecasts are weighted (default: equal weight per scored market).
- Benchmarks: what baseline we compare against for skill scoring.
- Reporting: sample size (N), coverage, and calibration diagnostics.
1. The Brier Score
The Brier Score measures how close your probabilities were to outcomes for binary events (YES/NO). It is the mean squared error of probability forecasts.
Brier Score = (1/N) * Σ (p - o)²
- p (Forecast): your probability for YES (e.g., 0.75).
- o (Outcome): 1 if YES happened, 0 if NO happened.
- Range: from 0.00 (perfect) to 1.00 (max error).
Lower is always better. Always forecasting 0.50 yields a Brier score of 0.25. (This is a useful baseline for balanced, 50/50 style questions.)
2. Timing and checkpoints (which forecast gets scored)
If you score only the last update right before settlement, you mostly reward late information. To make comparisons fair, we use an evaluation checkpoint rule on scorecards that need it.
Default checkpoint example: score the last forecast made at or before T-24h (24 hours before settlement).
If you did not submit a forecast by the checkpoint, that market is treated as missing for you (it reduces coverage).
- Multiple updates: only the forecast that exists at the checkpoint is scored.
- No look-ahead: benchmarks and consensus snapshots must be taken at or before the checkpoint time.
3. Coverage and minimum activity
Strong scores on tiny samples are usually noise. Every scorecard shows:
- N (sample size): number of scored forecasts.
- Coverage: how many eligible markets you participated in, not just the ones you picked.
Coverage matters because selection bias can make almost anyone look great if they only forecast easy questions.
4. Brier Skill Score (BSS)
Raw Brier Score is hard to interpret in isolation. A score of 0.15 can be amazing or average depending on how easy the questions were. Brier Skill Score compares you to a defined baseline.
BSS = 1 - (BS / BS_ref)
- BSS > 0: better than the baseline.
- BSS = 0: same as the baseline.
- BSS < 0: worse than the baseline.
5. Benchmarks (what baseline means here)
We use benchmarks that are explicit and reproducible:
Default benchmark: Base rate
The base rate is the historical frequency of YES in a relevant reference class. Base rates are always available and hard to game, so they are the clean default for BSS.
Example: if similar events resolve YES 62% of the time, the base rate baseline is p = 0.62 for that group.
Optional benchmark: Market consensus
Market prices can be a strong benchmark when liquidity is real. If we use market consensus, we define:
- Consensus price: typically mid price (not last trade) at the checkpoint time, or VWAP in a window ending at the checkpoint.
- Liquidity filters: spread caps and minimum volume thresholds to avoid thin-market noise.
- Timestamp discipline: consensus is captured at or before the checkpoint (no look-ahead).
6. Calibration vs sharpness
A strong Brier score usually comes from two components:
Calibration (reliability)
If you forecast 70% ten times, the event should happen about 7 times. If it happens 10 times, you were underconfident. If it happens 5 times, you were overconfident.
Sharpness (resolution)
Useful forecasters separate likely from unlikely when they have signal. Always saying 50% is safe but uninformative. Sharpness means moving away from 0.50 when justified.
7. Reference
Glenn W. Brier, "Verification of Forecasts Expressed in Terms of Probability", Monthly Weather Review, 1950.
If you want to replicate our results, the key is to use the same eligible set, checkpoint rule, and benchmark definitions. If any of these change, scores are no longer comparable.