Guides

Champion / Challenger

Test new scorecard versions against your production model before making a full switch. Champion/challenger testing lets you validate model performance with real traffic while maintaining a safe fallback.

What is Champion/Challenger Testing?

In credit risk modeling, the champion is the currently active production model — the one whose scores drive lending decisions. A challenger is a new model version you want to evaluate against the champion using real applicant data.

Rather than replacing your champion outright, you deploy the challenger alongside it. Both models score every incoming request, giving you a direct performance comparison without risking your production decisions.

Shadow Mode vs Live Split Mode

Calibr supports two testing modes. Choose based on your risk tolerance and regulatory requirements.

AspectShadow ModeLive Split Mode
Decision makerChampion onlySplit by traffic percentage
Challenger scoresLogged but not returned to callerReturned as the primary score for assigned traffic
Risk levelZero — no impact on decisionsModerate — challenger scores affect real decisions
Use caseInitial validation, regulatory reviewA/B testing after shadow validation
API responseAlways returns champion score; challenger in shadow_scoresReturns whichever model was selected for that request

Setting It Up

1

Deploy your first model (champion)

Deploy a scorecard from the Calibr desktop app. The first deployed model automatically becomes the champion.

2

Deploy a second model version

Build and deploy a new scorecard version. It will be marked as inactive until you configure it as a challenger.

3

Add as challenger with traffic allocation

In the web dashboard, navigate to your scorecard's deployment settings and add the new version as a challenger. Configure the mode and traffic percentage:

curl -X POST https://api.calibr.dev/api/v1/scorecards/sc_01HXYZ/challengers \ -H "Authorization: Bearer sk_live_abc123" \ -H "Content-Type: application/json" \ -d '{ "challenger_version": "v4", "mode": "shadow", "traffic_pct": 100 }'

In shadow mode, set traffic_pct to 100 so every request is scored by both models. In live split mode, start with a small percentage (e.g., 10%).

4

Monitor performance

The web dashboard shows side-by-side metrics for champion and challenger: score distributions, grade distributions, approval rates at various cutoffs, and population stability index (PSI). Compare these over at least two weeks of production traffic.

5

Promote or remove

If the challenger outperforms the champion, promote it to become the new champion. The previous champion is archived but not deleted — you can always roll back.

# Promote challenger to champion curl -X POST https://api.calibr.dev/api/v1/scorecards/sc_01HXYZ/promote \ -H "Authorization: Bearer sk_live_abc123" \ -H "Content-Type: application/json" \ -d '{ "version": "v4" }' # Or remove the challenger curl -X DELETE https://api.calibr.dev/api/v1/scorecards/sc_01HXYZ/challengers/v4 \ -H "Authorization: Bearer sk_live_abc123"

How Parallel Scoring Works

Regardless of mode, all models score every request. The engine loads both the champion and challenger specs, runs the scoring algorithm for each, and then decides which score to return based on the mode and traffic allocation.

This means you always have a complete comparison dataset. Even in live split mode, the champion's score is computed and logged for every request assigned to the challenger, and vice versa.

json
// Shadow mode response { "score": 687, // Champion score (always the decision) "pd": 0.034, "risk_grade": "B", "model_version": "v3", "shadow_scores": [ { "version": "v4", "score": 702, "pd": 0.028, "risk_grade": "B" } ] }

Rollback

If a promoted challenger shows degraded performance in production, you can instantly roll back to the previous champion:

curl -X POST https://api.calibr.dev/api/v1/scorecards/sc_01HXYZ/rollback \ -H "Authorization: Bearer sk_live_abc123"

Rollback restores the previous champion version and removes the current version from active scoring. Use rollback when you see:

  • A significant shift in score distribution (high PSI)
  • Unexpected approval/decline rate changes
  • A spike in missing value warnings indicating data drift
  • Regulatory or compliance concerns with the new model

Best Practices

  • Start with shadow mode. Always validate a new model in shadow mode before switching to live split. This gives you a risk-free baseline comparison.
  • Compare for at least 2 weeks. Short observation windows can be misleading due to seasonal effects, marketing campaigns, or application volume fluctuations.
  • Monitor grade migration. Pay attention to how applicants shift between risk grades — not just the overall score distribution.
  • Document the rationale. Regulatory requirements often mandate documentation of model changes. Keep a record of why you promoted or rejected a challenger.
  • One challenger at a time. While the system supports multiple challengers, comparing more than one simultaneously makes it harder to isolate which changes drive performance differences.