ProductFeb 18, 202612 min

vMira 5.1: under the hood.

vMira 5.1 is our most capable Russian-language model. Released today across all our products: long context, image input, an optional reasoning mode, and a deployment footprint that runs from a hosted API to a single on-device server.

VC
VCorp
Leadership
VC
VCorp
Research

vMira 5.1, our most capable Russian-language model, is available today across all our products. It is faster, more accurate, and longer-context than vMira 5. This post covers what is new in this release, how it performs, and how to deploy it.

vMira 5.1 reads and writes fluent Russian and English, accepts images alongside text, and ships an optional reasoning mode that shows its work on math, code, and structured analysis. Native context is 262,144 tokens; with our context-extension method, requests can run to one million tokens for specialised workloads. Inference runs on our hosted API, in a private deployment inside the Russian Federation, or in a compact on-device build that runs on a single Linux server.

The reasoning model — vMira Thinking — ships at the same time and is described in detail in a companion post. The rest of this page covers the base model: what changed in this release, how it performs, where it is weak, and how to integrate it.

What is new in 5.1

vMira 5.1 is faster on the prompts that already worked, and meaningfully better on the prompts that did not. The largest gains are in three areas. **Russian-language reasoning** is materially stronger: with the optional reasoning mode, grade-school math accuracy is up 38%, competitive-programming accuracy is up 24%, and confidently-wrong answers on our factuality evaluation are down 61%. **Long-context behaviour** holds quality through 256,000 tokens and degrades gracefully beyond — we recommend a retrieval pipeline above that point for mission-critical work. **Image input** is now a first-class modality across the full context window: scans, photos of documents, diagrams, and handwritten notes all work alongside text. Deployment is unchanged in shape — hosted, private region, on-device — but every layer of the stack is faster.

Training

Training was a multi-stage process. The first stage built broad command of Russian and English on a large corpus weighted toward Russian-language press, fiction, technical documentation, statutes, code, and conversational data. The second stage was a supervised pass on a curated mix split four ways: reward-filtered instruction data; reasoning examples drawn from olympiad-grade math, long chain-of-thought traces, and structured analysis; targeted examples for the categories of public Russian-language evaluation where models historically under-perform; and Russian cultural and conversational data without translation artefacts. The third stage is alignment — a step that teaches the model **when** to stop reasoning rather than just **which** answer to prefer.

Long context

Native context is 262,144 tokens — long enough for a complete book, a multi-file repository, or a research paper with its references. With our context-extension method, requests can run to one million tokens for specialised workloads such as long-form contract analysis or full-codebase audits. Two practical notes for production. First, above roughly 256K tokens, retrieval quality starts to depend on where in the context the relevant information sits — for mission-critical work we recommend a retrieval pipeline rather than raw long context. Second, pricing scales linearly across context lengths; there is no premium tier above 256K.

Image and vision input

vMira 5.1 accepts images as part of the conversation: screenshots, scans, photos of documents, diagrams, charts, hand-written notes. The model describes what it sees, extracts Russian and English text, answers questions about the contents, and reasons about page layout. It does not generate images — that is what vMira Studio is for — and it does not accept video as a single input. For frame-based reasoning, extract frames first and pass them as an image sequence; the model treats them as ordered context.

vMira 5.1 inference layout: long-context multimodal model with optional reasoning, available as hosted API, in-region private deployment, or on-device build.

Deployment

Three deployment paths cover ninety-five percent of what teams actually need. The **hosted API** runs in our data centre with rolling capacity, billed per token in rubles or dollars. **Private deployment in the Russian Federation region** is available for enterprise customers under a separate service-level agreement, with primary collection of personal data inside Russia in line with Federal Law 152-FZ; inference can run in the same region or, where the data has been documented as transferable, in a foreign region for cost. The **on-premise build** is a compact package that runs on a single modern Linux server. All three paths support continuous batching and structured-output constraints. Operationally, a consumer GPU serves dozens of concurrent active users in steady state; server-class hardware in our hosted region serves the chat product to hundreds of concurrent users per card. Your numbers will depend on your prompt distribution and the fraction of requests that engage reasoning.

Where it falls short

We try to be honest about what the model can and cannot do. The headline limits we know about today, and what we recommend in their place:

If a model post only lists wins, it is marketing. We list the limits because those are the conversations our customers actually need to have.
VCorp leadership

For developers

API access is at our developer portal, with SDKs in Python, TypeScript, Go, and Rust. Four model identifiers are exposed: `mira`, `mira-thinking`, `mira-pro`, and `mira-max`. Reasoning is opt-in via the `reasoning.effort` parameter (low | medium | high). All requests are billed by token, never by time, so reasoning depth does not surprise you on the cost line. Streaming, tools, structured JSON output, file inputs, and web search are first-class on every model. Enterprise deployments include private endpoints, dedicated capacity, separate region pinning, and direct human support for incident response.

How 5.1 feels in everyday use

Long documents drop in cleanly: a hundred-page contract, a multi-chapter book, a research paper with appendices — all fit native context and are answered in seconds. Image input is the largest qualitative change. Where the previous model could read printed Russian text from screenshots, 5.1 reads handwriting, fills in document layout context, and reasons about diagrams that mix Russian and English. Voice in chat is more natural on the harder consonant clusters and on words with mobile stress, where older models systematically tripped. The reasoning model is opt-in via a single API parameter and is described in detail in a companion post.

Precision builds

We ship multiple precision builds. The hosted API runs in a high-precision configuration tuned for quality on the public categories of our evaluation framework. Private deployments and on-device builds use more compact builds, with the quality delta for each documented in the corresponding release report so customers can pick the trade-off their workload tolerates. For most teams, the on-device build is the right pick when data cannot leave the network and latency is not the binding constraint.

Concurrency on commodity hardware

For self-hosters, the practical question is how many concurrent active users a single GPU can serve. A modern consumer GPU serves dozens of concurrent active users in steady state, with a hard ceiling that depends on context length and reasoning effort. Server-class hardware scales this several times over; one card in our hosted region serves the chat product to hundreds of concurrent users. Your numbers will depend on your prompt distribution and the fraction of requests that engage reasoning, but they are reproducible on the same class of hardware we run.

What the API actually returns

Every chat call returns a structured response: `message.content` (the answer), optionally `reasoning.summary` and `reasoning.steps` (when reasoning was enabled), `usage` (input, output, and reasoning tokens, billed separately), and `model_meta` (the exact build hash, region, and runtime engine that served the request). The build hash is not cosmetic — if you log it with each request, you have a reproducible audit trail of which exact model produced which answer, useful for compliance and for narrowing down regressions when something changes between releases. We retain the build for six months past its successor's release, so a customer can pin to a known-good model while they validate the next.

On-device deployment: when and why

The on-device build is the path for customers whose data cannot leave the network. It runs on a modern Linux server with a single consumer GPU for production-grade throughput, or on a CPU-only host for lower-volume internal tools. CPU throughput is slow enough that you would not run a chat front-end off it, but fast enough for back-office automation, document analysis, and reflective work that a human reads. The on-device build has the same instruction-following profile as the hosted model, the same refusal calibration, and the same image-input head. It does not include web search or hosted tools; those have to be wired in by the customer to systems they control.

Did you find this article useful?