Naming Food Across Six Languages: What We Learned

Every food name in Vnutri is translated into 6 languages: en, es, ca, fr, de, ru. That's about 5,000 unique strings (845 foods × 6 locales). When we started, it looked like a dictionary problem — find a glossary and run the catalog through a translator. Three months in, we had a different conclusion: translating food names isn't a dictionary problem.

Why straight translation fails

Three issues.

1. A name isn't a description. "Pollock" in English is a fish. In Russian, it's "минтай". Google Translate gives "поллок" (a phonetic transliteration), because it doesn't know the word names a fish. Knowing which fish requires category knowledge, not literal translation.

2. Varieties are regional. Apple Honeycrisp is everyday in the US. In Europe, it's rare. Apple Granny Smith — the opposite. When I say "apple" in English I'm implicitly thinking of an average American variety; "manzana" in Spanish means an average Spanish variety. Different physical produce with close but not identical nutrition.

3. Regional vocabulary. "Eggplant" (US) = "aubergine" (UK) = "berenjena" (ES) = "albergínia" (CA) = "aubergine" (FR) = "Aubergine" (DE) = "баклажан" (RU). Already two conventions within English. Same with "zucchini" / "courgette".

And the worst case — scientific names. "Salmo salar" is Atlantic salmon. Most translators just keep the Latin ("Anguilliformes" instead of "eel"), because they don't know what it refers to.

Three layers of localization

We built a pipeline of three layers, each cheaper than the last, but less accurate for rare cases.

Layer 1: OFF ingredients taxonomy

The Open Food Facts ingredients taxonomy is a curated multilingual food dictionary. 4,212 entries across 100+ languages each. ODbL-licensed.

Match rate: ~70 % of catalog foods land in OFF. First and cheapest pass.

Match tiered:

Direct (apple → apple) → take translation.
Singularized (apples → apple) → take translation.
2-token sorted subkey (black beans → beans + black) → take translation.
Reversed subkey (beans black → black beans) → take translation.
Head noun (chocolate dark → chocolate) → take translation (carefully).

On a match — we take ready-made names from OFF across all 6 locales. Quality is excellent: OFF is curated, false matches are essentially impossible.

Layer 2: Wikidata

For the remaining ~30 % — Wikidata via the wbsearchentities API. Tiered picker with P31 (instance of food) filters.

Exact label/alias match + food signal in description/P31.
Substring overlap + a strong food keyword (vegetable, fish, meat).
P31 in a food whitelist (Q2095 food, Q3314483 cultivar, Q502163 fruit).
Description matches broad food regex.

Within a tier we sort by realTranslationCount — this filters out scientific binomials that have been copied unchanged into every language.

Match rate: ~25 % additional foods. Cached. We capped around 476 foods — Wikidata's aggressive unauthenticated rate limits stopped further crawling.

Layer 3: Google Cloud Translation v3 (Translation LLM)

The final pass. general/translation-llm model in us-central1. This is not ordinary NMT (neural machine translation) — the Translation LLM is better for food names with state.

Why LLM beats NMT here:

Gender agreement. "cooked adzuki beans" → Spanish: "judías adzuki cocidas" ✓ feminine plural. NMT defaults to masculine "cocido", which is grammatically wrong. The LLM sees the noun in the same string and applies correct agreement.
Idiomatic vocabulary. "pollock" → Russian: NMT gives "поллок"; LLM gives "минтай". "cloudberry" → French: NMT echoes "cloudberry"; LLM gives "mûre des marais".

Cost — $80/M chars (vs $20 for NMT). For 5,000 × 30 characters × 5 non-EN locales ≈ $60 per full run. Cached per (phrase, lang).

After the LLM pass — a hardcoded fix table for 13 LLM hallucinations we caught with manual review.

Concrete pitfalls

What we learned the hard way.

Brand names. "Blackberry" in English is a berry. The LLM sometimes translates it to "BlackBerry" (the phone) in Spanish. Fix: a hardcoded brand blocklist.

Abbreviations. "fig" (the fruit) — the LLM in German gave "Abb." and in Russian "рис." (thinking it was "figure" in a table). The word was out of context, and the LLM picked the statistically common reading "fig." (figure abbreviation).

Passive-aggressive binomials. Wikidata often stores Latin names as primary labels ("Anguilliformes" for "eel"). If those binomials are copied as "translations", 100+ languages end up with the same Latin string — looking like "lots of translations" but in fact one string. A filter on realTranslationCount (count of distinct strings across languages) removes those cases.

Verb/noun flips. "skate" in English is a fish (a ray). The LLM in Russian gave "кататься на коньках" (to skate). The context "100 g skate" didn't help — the LLM picked the more frequent reading.

Sushi effect. Japanese words borrowed by most European languages. "Tofu", "miso", "sushi", "edamame" — stay as-is in all 6 locales. We don't translate them.

State suffix

A separate problem — state. "apple cooked" → Russian should be "яблоко, варёное" (neuter gender agreement). The LLM handles this, but only if a comma-suffix is transformed to an adjective-prefix before sending: "apple, cooked" → "cooked apple" → "варёное яблоко".

Otherwise the LLM translates the comma form literally as "яблоко, готовое", a calque from English and grammatically awkward in Russian.

This is a hardcoded transform before the LLM call. Comma-form → natural noun phrase → LLM → natural target noun phrase.

Word order by locale

English puts the adjective before the noun: "red apple". French goes after: "pomme rouge". Spanish — usually after ("manzana roja"), sometimes before ("buena manzana"). Catalan — almost always after.

Direct translation keeps the source-language order. That gives awkward output for long compound names: "raw black beans" → "judías negras crudas" (correct) vs "crudas judías negras" (wrong).

A final reorder-pass (Claude Sonnet, cached) walks each locale and rewrites names into a noun-first form. It's its own pipeline stage.

We don't write halal/kosher

These are about process certification, not the product itself. "Kosher cheese" isn't the same cheese as ordinary cheese (production requirements differ). Vnutri doesn't tag products as halal/kosher — it's not a product property, it's a process property. See 9 diets explained.

Coverage

After the three pipeline layers:

en — 100 % (canonical)
es — 95 %
ca — 90 %
fr — 94 %
de — 92 %
ru — 89 %

The remainder — where the name simply doesn't exist in OFF/Wikidata, and the LLM produced something incorrect. They get hand-fixed in name-overrides.json.

What we don't do

No brand translation. If a name contains a brand, it stays as-is.
No LLM freedom. Every translation is checked against OFF/Wikidata where we have a data point.
No NMT. Only Translation LLM or curated taxonomy.

Open problems

Regional varieties. "Apple" — the catalog has one averaged apple. A Russian user thinks of antonovka or gala; an American thinks of Honeycrisp or Red Delicious. Different physical produce with close but not identical nutrition.

Local dishes. Pelmeni, borscht, kasha — words that exist in each language but refer to slightly different physical dishes. Vnutri usually uses the source-language name with an English transliteration as the anchor.

Transliteration vs translation. "Sushi" in German is "Sushi". "Sushi" in Russian is "суши". "Pizza" in German is "Pizza". "Pizza" in Russian is "пицца". When to transliterate vs translate — no hard rule, we follow convention.

Catalog architecture — where our data comes from. Implementation details — in the apps/scripts/ directory of the repo.

References

Open Food Facts. Ingredients Taxonomy. https://github.com/openfoodfacts/openfoodfacts-server/tree/main/taxonomies/food
Wikidata Foundation. Wikidata Query Service. https://query.wikidata.org/
Google Cloud. Cloud Translation API v3 documentation. 2024.
Wood AJ, Lengyel P, et al. Multilingual neural machine translation: research and product. Trans Assoc Comput Linguist. 2020.