Independent Holdout Validation

Validation Methodology: 84.3% Accuracy at 12 Months

The Stability Engine retention risk scoring system has been independently validated using N-1 temporal holdout methodology across a cohort of 51 candidates. This page explains the methodology, the results, and the honest limits of what the validation shows.

84.3%
12-month accuracy
72.5%
score accuracy ±15pts
0.169
mean Cox Brier score
n=51
holdout cohort

Validation methodology: N-1 temporal holdout

The validation uses N-1 temporal holdout — a methodology designed to test whether a scoring system generates useful predictions from only the data available at the pre-hire stage.

How it works:

  1. 01

    Withhold the most recent completed role

    For each candidate in the validation cohort, the most recent completed role — with a known start date, end date, and tenure length — is removed from the career file. This becomes the ground truth.

  2. 02

    Score on prior history only

    Stability Engine runs on the remaining career history — what was visible before that last role started. This mirrors the actual pre-hire information state: the system scores only what would have been available at the moment of the offer.

  3. 03

    Predict 12-month retention

    The system generates a Stability Score and a 12-month retention probability for each candidate, based solely on prior career history.

  4. 04

    Compare prediction to ground truth

    The predicted 12-month retention outcome is compared to the actual tenure of the withheld role. A candidate who scored in the higher risk bands and departed within 12 months is a correct prediction. A candidate who scored in the lower risk bands and remained past 12 months is also a correct prediction.

Validation results

MetricResultWhat it measures
12-month accuracy84.3%Correct binary classification at the 12-month retention threshold
Score accuracy72.5%Stability Scores within ±15 points of the reference label
Mean Cox Brier score0.169Probabilistic calibration quality (0 = perfect, lower is better)
Cohort sizen=51Total candidates in the holdout validation cohort
MethodologyN-1 temporal holdoutPrior career history only; most recent role withheld as ground truth

What the Stability Score measures

The Stability Score analyzes structural career history signals — not interview performance, personality assessments, or self-reported preferences. The relevant signals include:

  • Prior tenure patterns: how long the candidate stayed across completed roles, and the distribution of that tenure
  • Transition density: how quickly the candidate has moved between roles and environments
  • History alignment: whether the prior career pattern matches the stability demands of the role being assessed
  • Environmental fit signals: whether prior operating environments resemble the current one

The score is a directional signal, not a verdict. It does not tell a hiring team to hire or not hire a candidate. It provides a structured basis for calibrating onboarding investment, monitoring cadence, and early intervention — not for replacing the human judgment that belongs in any serious hiring process.

Honest limits of the validation

The validation establishes predictive signal, not certainty. Several important limitations:

  • The holdout cohort is n=51. This is a meaningful validation data point but not a large-scale epidemiological study. Additional validation is ongoing as outcome data accumulates.
  • The score captures career history pattern risk — not environmental factors, management quality, or post-hire conditions that also affect retention.
  • A high Stability Score does not guarantee retention. A lower score does not mean a hire will fail. Scores are probability distributions, not individual predictions.
  • The N-1 methodology tests the system against prior career history only. It does not test prediction performance in real-time, concurrent hiring conditions.

Frequently asked questions

What is the N-1 temporal holdout methodology?

The most recent completed role in a candidate's career history is withheld as the ground truth. The scoring system runs on prior career history only — what was visible before that last role started. The prediction is then compared to what actually happened in the withheld role.

What does 84.3% accuracy at 12 months mean?

Stability Engine correctly identified early-departure risk at the 12-month threshold for 84.3% of candidates in the validation cohort — both high-risk candidates who did depart, and lower-risk candidates who remained past 12 months.

What is a Brier score and what does 0.169 indicate?

The Brier score measures probabilistic prediction accuracy on a 0-to-1 scale. 0 is perfect; higher is worse. A score of 0.169 indicates well-calibrated probabilistic forecasts — the stated probabilities of early departure track closely with observed departure rates.

Where can I read the full validation study?

The Ros Holdout Validation Study 2026 is available for download at stabilityengine.ai/audit. It covers methodology, cohort composition, result tables, failure mode transparency, and interpretation guidance.

Download

Read the full validation study

The Ros Holdout Validation Study 2026 covers the complete methodology, cohort composition, results, failure mode transparency, and score interpretation guidance.