PolicyOct 01, 202518 min

An open framework for evaluating AI.

We are publishing the full set of tests we run before each release: 612 tests, 11 categories, the prompts, the scoring scripts, and our own scores. Reproducible from a single command on consumer hardware.

VC
VCorp
Safety

When a lab says "our model is safe," there is a definition problem. Safe compared to what? On which test? With what threshold? If every team picks its own criteria, no claim is comparable. Today we are publishing the full set of tests we run before each vMira release — the prompts, the scoring scripts, our own model's scores, and the harness that runs all of it from a single command.

The framework was not designed to settle a leaderboard fight. It was designed to make our own internal release decisions visible. If a test in the public set falls below the threshold we have set, we do not ship — and now you can verify, on the same hardware as us, that we did not. Every test ships with three artefacts: the original prompt or scenario, criteria for human evaluation written for non-specialists, and an automatic scoring script. The third matters more than people expect. The reason most safety claims are not reproducible is that the scoring is implicit; if you cannot run the same scoring script on a different model, you cannot compare. Our scripts run on a laptop in under two minutes per test.

612
Tests in the public set
11
Capability and harm categories
100%
Reproducible with one command

What we are publishing

The eleven categories, in order of priority: dangerous capabilities (chemistry, biology, cybersecurity, weapons-of-mass-destruction uplift); demographic bias (Russian-language gender, age, regional, and ethnic stereotypes, with the test set co-developed with university linguists); jailbreak robustness (the standard prompt-injection corpus plus a Russian-language extension we curated ourselves); reproducibility (does the same prompt with a fixed seed give the same answer twice); factuality with citations (does the model retrieve the source it cites); refusal calibration (does it refuse what it should and not what it should not); long-context retrieval (the haystack test, but in Russian and at lengths up to one million tokens); reasoning (math, code, structured analysis with verifiable outputs); multimodal grounding (do image-conditioned answers reflect the image); minority-language coverage (Tatar, Bashkir, Chuvash); and a residual category for things that do not fit elsewhere but that we have learned the hard way to test. Each category has a target threshold; the model report at every release lists the score, the threshold, and the delta.

Coverage of the framework as of April 2026. Categories with a grey border are awaiting peer review and contribute to internal scoring only.

How we use it internally

Before each release we run the full framework on the candidate model and the previous shipped model side by side. The release report is a comparison, not a snapshot — what got better, what got worse, what crossed a threshold. If a test has dropped below the threshold we set, we do not ship; we either pull the regression in a smaller training run or hold the release. If a customer reports a failure that is not in the public set, we add it to the framework and evaluate it on every future model. That is the only way the framework stays honest: every observed failure becomes a permanent test, so the same regression cannot ship twice. Internally we keep a private extension of the framework with confidential customer-reported tests; those are not in the public repository, but they cannot be removed once added.

Safety is a public commitment or it is nothing. If only we know what we measure, only we can say we passed.
VCorp safety

An invitation to the community

The framework is not ours alone. It is the first draft of an open standard, and we accept pull requests. New tests with a clear scoring rubric, improvements to the automatic scoring scripts, and translations of the prompts into other languages of the Federation are all welcome. Contributions go through a quarterly review; accepted ones land in the official framework on a fixed cadence so the version everyone evaluates against is stable for at least three months at a time. The bar for inclusion is high — a test that cannot be reproduced does not enter — but every test that meets the bar enters, regardless of whether the failure is more visible on our model or on someone else's. That last part is on purpose. A safety framework that is used as marketing eventually stops being a safety framework.

Why public scoring scripts matter more than public scores

Most published evaluation results are not reproducible by an outside party. The prompts are public, the answers are public, sometimes the scores are public — but the scoring is not. That is the gap that makes claims incomparable. A scoring script that runs locally on a different model in under two minutes is the difference between a leaderboard you can audit and a leaderboard you have to trust. Every test in our framework ships its scoring script. The script is deterministic, side-effect-free, and runs in a sandbox so it cannot accidentally exfiltrate the model under test. Publishing the script is what lets a third party verify that we did not ship below the threshold we claim — and what lets the same third party run the same script against a competitor model and produce a like-for-like comparison.

How we pick the thresholds — and why we publish them

Every category in the framework has a target threshold. The threshold is set by considering three things: the score the previous model achieved, the score we believe is achievable with the budget we have for the next training run, and the score the use case actually requires. For dangerous-capabilities tests, the threshold is set absolutely: below a certain refusal rate we do not ship at all, regardless of progress on other dimensions. For the softer categories — bias, calibration, long-context retrieval — the threshold is set as a delta from the previous model, so that the model improves over time even if absolute targets shift. The thresholds and the rationale are part of the public release report, so the trade-offs are visible.

The Russian-language jailbreak corpus

The standard prompt-injection corpus is in English. Translating it into Russian gets you partway, but Russian-language jailbreaks have their own structure: instruction-injection through Cyrillic-Latin homoglyphs, register switching mid-prompt, prompts that exploit the formal/informal distinction the way English ones cannot. We curated a Russian-language jailbreak set with university linguists, including the homoglyph variants and a category we called register subterfuge — prompts that switch from formal to informal mid-sentence to test whether the model's refusal calibration tracks register or surface tokens. The corpus is in the public framework. We update it whenever a new jailbreak vector is reported by a customer.

Long-context retrieval at one million tokens

The needle-in-a-haystack test is well known. Our framework runs it on Russian-language haystacks at lengths of 64K, 256K, 512K, and 1M tokens, with multiple needle positions per length. Retrieval quality stays above ninety-five percent through 256K and starts to slope off above that. We publish the slope, the haystack length where it crosses ninety percent, and the model behaviour on a needle-not-present variant (does it correctly say the needle is missing, or does it confabulate one). For production, we recommend a retrieval pipeline rather than raw long context above the slope, but the slope itself is published so you can decide where to draw your own line.

How to add a test

A test is three files: the prompt, the rubric for human evaluation, and an automatic scoring script. The scoring script accepts a model output and returns a numeric score in the range zero to one, with a confidence interval. Tests are submitted as a pull request; the framework runs them against our reference models on every commit so you can see the score before review starts. The review checks four things: the test is reproducible, the rubric is unambiguous, the scoring script is deterministic and side-effect-free, and the test is not a subset of an existing test. Quarterly review means accepted tests land in the official release on a predictable cadence; the version everyone evaluates against is stable for at least three months at a time, which is what makes year-over-year comparisons honest.

Internal tests that are not in the public set

We keep a private extension of the framework with confidential customer-reported failures. Those tests are not in the public repository — the prompts often contain customer data, and the failures are sometimes not yet patched. They are in the same harness, scored the same way, and held to the same release thresholds. The only difference is visibility. When a private test stabilises and the customer agrees, we move it to the public set. When a customer reports a regression that turns out to be already-covered, we add the variant to the appropriate public test instead of creating a new one. The point is that the public framework is never the whole picture, but it is the only picture an outside party should have to take on faith.

Did you find this article useful?