Log Loss vs Brier: Which One Punishes You More and Why
Two proper scoring rules, two different personalities
Both Brier score and log loss reward honest probabilities. Both are proper scoring rules.
The difference is how they punish mistakes, especially extreme mistakes.
Brier score in one line
Brier score is squared error:
(p - o)^2
• p is your predicted probability of YES
• o is outcome (1 for YES, 0 for NO)
It punishes mistakes smoothly. The penalty grows with distance, but not explosively.
Log loss in one line
Log loss penalizes you based on how much probability you assigned to what actually happened.
For a YES outcome:
LL = -log(p)
For a NO outcome:
LL = -log(1 - p)
It punishes extreme wrong calls extremely hard.
Worked examples: the same forecast, two scores
Case 1: you predict 0.90 and YES happens
• Brier: (0.90 - 1)^2 = 0.01
• Log loss: -log(0.90) is small
Both reward you for being confidently right.
Case 2: you predict 0.90 and NO happens
• Brier: (0.90 - 0)^2 = 0.81
• Log loss: -log(0.10) is very large
This is the key difference. Log loss treats this as a disaster, because you assigned near zero probability to what occurred.
Case 3: you predict 0.60 and NO happens
• Brier: (0.60 - 0)^2 = 0.36
• Log loss: -log(0.40) is moderate
Both penalize the mistake, but log loss is not as explosive here because you did not go extreme.
Intuition: why log loss is harsher
Log loss cares about the probability assigned to the realized outcome. If you claim an outcome is almost impossible and it happens, log loss is designed to crush you.
Brier score still penalizes you heavily, but it is bounded between 0 and 1 for binary events. Log loss is unbounded as p approaches 0 or 1.
When Brier score is a good fit
• you want a stable, interpretable metric for scorecards
• you want to pair it with Brier skill score vs a benchmark
• you care about decomposing performance into reliability and resolution
When log loss is a good fit
• you want to strongly discourage extreme probabilities unless you are truly sure
• your system is sensitive to rare tail events and you want to penalize missing them
• you want a metric aligned with probabilistic likelihood methods
Why many platforms use both
A practical pattern is:
• headline score: Brier score plus BSS
• risk control score: log loss as an additional diagnostic
This gives users a stable main metric and a clear warning when they go too extreme.
Probability clipping: a common implementation detail
Because log loss explodes at 0 and 1, many implementations apply probability clipping, for example forcing probabilities into a range like 0.01 to 0.99.
Clipping is not cheating, but it is a methodology choice. If you clip, document it.
Common mistakes
Comparing raw numbers across metrics: a Brier score of 0.18 is not directly comparable to a log loss of 0.52. They are different scales.
Ignoring sample size: with small sample size, both metrics are noisy, and log loss can be dominated by one extreme miss.
Reading log loss without calibration context: pair it with calibration diagnostics.
Takeaway
Brier score is a stable squared error metric. Log loss is harsher and punishes extreme wrong calls much more. Use Brier (and BSS) for scorecards and comparisons, and use log loss as a second metric when you want to discourage reckless certainty.
Related
• Log Loss
• Brier Score vs Brier Skill Score: When to Use Which
• Brier Score Decomposition: Reliability, Resolution, Uncertainty