Shipping a product into a new country is exciting right up until you remember the manual has to go with it. The setup guide, the safety warnings, the messages on the screen, all of it needs to read like a local wrote it. And these days the obvious move is to hand the whole thing to an AI model and let it run.
We did exactly that on a recent project, and then we did something most teams skip. Instead of trusting one model, we ran the document past 22 of them and compared what came back. The gap between the best single output and the version the majority landed on was wider than we expected, and it changed how we localize technical documents now.
This is the full walkthrough: what we translated, how we did it step by step, and the numbers that came out the other side.
Why one AI model is a gamble on a technical document
Here is the uncomfortable part. The leading models are good, but they are not reliable in the way a manual needs them to be. Independent testing synthesized from Intento and the WMT24 findings puts the hallucination rate of individual top-tier large language models between 10% and 18% on translation tasks. On a casual email, a 10% slip is annoying. On a 40-page hardware manual with safety steps and interface strings, it is a recall waiting to happen.
The reason is structural, not something you can prompt your way out of. In much the same way machine learning models are trained on probabilities rather than certainties, a single model will confidently produce an output that sounds right and is quietly wrong. Different models fail in different places. One drops a negation. One invents a date format. One flattens a formal register into something too casual for a regulator. The recent Intento State of Translation Automation report makes the same point from the other direction: baseline systems averaged 10 to 15 errors per text, while workflows built to verify outputs cut that to nearly zero.
There is also a trust problem sitting underneath all of this. Stanford’s AI Index found that most people are now more nervous than excited about AI, and that nervousness has been rising year over year. If your own team does not fully trust the output, you end up re-reading every line by hand, which defeats the point of using AI at all.
What we were actually localizing
The document was a product manual for a connected device, headed into six European markets at once: French, German, Italian, Spanish, Portuguese, and Polish. It mixed three kinds of content that each break in their own way. There was plain instructional prose. There were interface strings pulled straight from the firmware, complete with tags and placeholders that cannot move or the build breaks. And there were a handful of legal and safety lines where a wrong word is not a style choice, it is liability.
Polish was the one we were most worried about, and for good reason. Single models hold up reasonably well on French, German, and Spanish, but they slide on morphologically complex languages. Our internal benchmarks, in line with Intento’s, show top single models plateauing around 84 to 87% on French, German, and Spanish, and dropping to roughly 76% on Polish.
The step-by-step
Step 1: Keep the document whole. We did not copy text into a chat box. We uploaded the file in its original format so the tags, tables, and layout stayed intact. Interface strings kept their placeholders, and the legal section kept its structure. Nothing had to be rebuilt by hand later.
Step 2: Run it past every model at once, not one at a time. Instead of picking a favorite model and hoping, we sent the content through 22 of them in parallel and lined the results up side by side. This is the idea behind SMART, the system inside MachineTranslation.com, an AI translator that compares the outputs of 22 AI models and selects the translation that most of them agree on.
Step 3: Let the majority decide, then study the disagreements. Where the models converged, we had high confidence. Where they split, the split itself was the signal. A line that 19 models render one way and 3 render another is a line worth a human eye. We were no longer guessing which model to trust. We were reading exactly where they failed to line up.
Step 4: Lock terminology across the whole document. A manual that calls the same button three different names is worse than one with a typo. Requiring majority agreement across models filtered out the outlier word choices that cause that drift. In our internal benchmarks, this held terminology and register consistent at a rate above 96% across the full document set, against an industry baseline closer to 78% for single-model output at the same volume.
Step 5: Send the high-stakes lines to a human. The safety and legal strings did not get to ride on AI confidence alone. We escalated them to a professional reviewer inside the same workflow, which is where the 100% accuracy guarantee comes from. The majority pick got us most of the way at scale, and the human sign-off made the last few lines certain.
What the numbers looked like
Across the whole document, the multi-model approach cut critical error risk by around 90% compared with leaning on a single model, pulling the effective error rate to under 2%. The European languages that single models already handle well stayed in the 93 to 95% band. Polish, the one we were braced for, climbed to roughly 88%, well clear of where any single model had left it.
“The question is not which AI model is best. On a long document, the question is which rendering 22 models converge on, because that is the one least likely to carry a silent error into a market you cannot read.”
Ofer Tirosh, CEO of Tomedes
The part you can reuse
You do not need our exact setup to apply the lesson. If you localize technical content, the principle travels:
- Never ship a document on the confidence of a single model. Compare several and watch where they disagree.
- Treat disagreement as a map. The lines models argue over are the lines that need a human.
- Protect structure first. Work in the original file so tags, placeholders, and layout survive.
- Reserve human review for the lines that carry real risk, not for the whole document.
That is the shift. Localizing a manual is not about finding the smartest model in the room. It is about refusing to bet everything on any one of them. The teams getting this right treat AI output as a set of opinions to be reconciled, which fits neatly with how the most useful applied AI tools are being built right now. And for a look at who is building the strongest of the underlying models, this engineering-team breakdown is worth a read.

Della Lovellerds writes the kind of smart device integration tactics content that people actually send to each other. Not because it's flashy or controversial, but because it's the sort of thing where you read it and immediately think of three people who need to see it. Della has a talent for identifying the questions that a lot of people have but haven't quite figured out how to articulate yet — and then answering them properly.
They covers a lot of ground: Smart Device Integration Tactics, Innovation Alerts, Tech Optimization Hacks, and plenty of adjacent territory that doesn't always get treated with the same seriousness. The consistency across all of it is a certain kind of respect for the reader. Della doesn't assume people are stupid, and they doesn't assume they know everything either. They writes for someone who is genuinely trying to figure something out — because that's usually who's actually reading. That assumption shapes everything from how they structures an explanation to how much background they includes before getting to the point.
Beyond the practical stuff, there's something in Della's writing that reflects a real investment in the subject — not performed enthusiasm, but the kind of sustained interest that produces insight over time. They has been paying attention to smart device integration tactics long enough that they notices things a more casual observer would miss. That depth shows up in the work in ways that are hard to fake.