Metodologia8 min di lettura

Where Vnutri's Nutrition Data Comes From: 8 Databases, One Catalog

Eight open food-composition databases, how we merge them into a single 845+ food catalog with 38 nutrients each, and the methodological trade-offs behind it.

Map of 8 food composition sources — USDA, Ciqual, CoFID, AFCD, Frida, Matvaretabellen, CNF
Non abbiamo ancora tradotto questo articolo — mostriamo la versione in inglese.

"How many calories in an apple?" — a simple question with a not-so-simple answer. It depends on the variety, ripeness, origin, and who measured it. Numbers on labels, in textbooks, and in apps can disagree by 1.5–2×. This article is about how Vnutri tries to close that gap.

The Vnutri catalog holds 845+ everyday foods and 340 dishes, each with 38 nutrients. Underneath: 8 curated, openly licensed food-composition databases merged into a single set. Here's what they are, why we picked them, and how we combine them.

What is a food composition database?

A food composition database is lab-measured nutrient data for foods. Each record is a specific product (e.g. "Apple, raw, with skin, including foods for USDA's Food Distribution Program") with dozens to hundreds of nutrient columns per 100 g.

These databases are built by national institutions: USDA in the US, ANSES in France, FSANZ in Australia, and so on. The data come from instrument analysis (HPLC for vitamins, ICP-MS for minerals, gas chromatography for fatty acids) on regularly sampled foods. It's expensive: a typical record costs labs several thousand dollars.

That's why most of these databases are public — they were built with taxpayer money, and governments require them to be open.

Vnutri's eight sources

Source Records License Region
USDA FoodData Central (Foundation + SR Legacy) 7,928 Public domain US
USDA FNDDS 2021–2023 (mixed dishes) 5,431 Public domain US
Canadian Nutrient File 5,690 OGL Canada Canada
UK CoFID (McCance & Widdowson 2021) 2,636 OGL v3.0 UK
ANSES Ciqual 2020 2,298 Etalab France
Matvaretabellen 2,118 NLOD Norway
AFCD (FSANZ Release 3) 1,588 CC BY 3.0 AU Australia
Frida (DTU) 1,381 Open Denmark
USDA Choline DB ~25 Public domain US

About 29,000 source records before filtering.

Why these specifically

Three selection criteria.

  1. Open license. Public domain, CC, OGL, NLOD. No "personal use only" terms — we're building a commercial product. That rules out closed bases like NEVO (Netherlands) and some university databases.
  2. Regular updates. Databases labelled with "Release 18" and refreshed every 3–5 years. That excludes archived or one-off projects.
  3. Lab measurement, not recipe calculation. We chose sources where most records come from actual instrument analysis. USDA FNDDS is used only for dishes (its values are recipe-calculated, and they say so explicitly).

What's not included, and why:

  • Open Food Facts (OFF) — user-contributed, ODbL. Too noisy: branded products, garbled names, no validation. We use OFF only for localization (a multilingual food name dictionary), not for nutrient values.
  • Fineli (Finland) — a great database, CC BY 4.0, but CLI access is blocked and we haven't done the manual export yet.
  • Livsmedelsverket (Sweden) — CC0, manual drop, on hold.
  • NEVO (Netherlands) — closed license.

Which nutrients

38 nutrients per food:

Energy and macros (10): calories, protein, fat, carbohydrate, fiber, sugars, starch, saturated/mono/poly/trans fats, cholesterol.

Minerals (10): calcium, iron, magnesium, phosphorus, potassium, sodium, zinc, copper, selenium, manganese, iodine.

Vitamins (13): A, retinol, D, E, K, C, B1 (thiamine), B2 (riboflavin), B3 (niacin), B5 (pantothenate), B6, B9 (folate), B12.

Fatty acids (3): omega-3, omega-6, plus the per-fat breakdown.

Other (3): choline, lactose, glycemic index (where available).

That's more than any single source publishes: USDA SR Legacy doesn't list iodine, USDA FDC doesn't list choline, Ciqual doesn't list selenium, and so on. Each nutrient is sourced from the databases that actually publish it.

How we merge sources

Simple averaging doesn't work. Different databases sample different varieties, different methods, different regions. The same "apple" in USDA and in Ciqual is physically different produce.

So Vnutri clusters records from all sources into groups (food × variety × state) and then takes a weighted median per nutrient inside each group.

Concretely:

  1. Name normalization. Strip category suffixes ("raw, with skin, includes…"), apply synonyms (yoghurt → yogurt, aubergine → eggplant), normalize formatting.
  2. Clustering. Sorted tokens + state (raw / cooked / dried) → cluster key. "Black beans, cooked" from USDA and "Beans, black, cooked" from CoFID land in the same cluster.
  3. Sanity check. Atwater check: predicted calories (protein × 4 + fat × 9 + carbs × 4) should match the reported value within ±25 %. Records outside that range get dropped — usually a data-entry error.
  4. Weighted median. USDA Foundation, Ciqual, CoFID, CNF, Frida, AFCD — weight 3. USDA SR Legacy, Matvaretabellen — weight 2. Median, not mean, so a single outlier doesn't pull the result.
  5. Minimum sources. A cluster needs data from at least 2 sources. Single-source anomalies are dropped.

The result: one record per food with the best available data per nutrient. The sources are listed on each food's detail page.

Nutrient coverage

Not every nutrient is measured with equal care. Coverage across 845 foods:

Coverage Nutrients
100 % Calories, protein, fat, carbohydrate
90–95 % Fiber, calcium, iron, sodium, potassium, magnesium, phosphorus, niacin, A, C, B1, B2, B6, zinc, copper, folate, B12
85–90 % Sugars, cholesterol, sat/mono/poly fats, manganese, selenium, D
75–85 % Pantothenate, E, omega-3
60–75 % Omega-6, starch, trans fats, K
50–60 % Choline, iodine

Iodine and choline are stuck at 50–60 % because of source limits: USDA SR Legacy doesn't report iodine (the column exists but isn't filled), and choline only lives in USDA and CNF.

What about dishes

About 340 dishes in the catalog. Around 150 come from USDA FNDDS 2021–2023 — a government dataset of dishes with recipe-calculated values. The rest (~190) are regional dishes with no FNDDS analog: borscht, pelmeni, bibimbap, dal, pho, jollof rice, and so on. Their nutrient values are LLM-estimated by Claude Opus based on a typical recipe and lab-measured ingredients from the main catalog.

These dishes carry an "approximate" badge on their detail page — their nutrient profile is an LLM estimate, not lab data. Accuracy here is meaningfully lower than for single-ingredient foods.

Name localization

Food names in the catalog are translated into 6 languages (en, es, ca, fr, de, ru). The translation pipeline has three layers, cheapest first.

  1. OFF taxonomy — a curated multilingual food vocabulary, 4,212 entries across 100+ languages. From the Open Food Facts ingredients taxonomy on GitHub. Match rate ~70 %.
  2. Wikidatawbsearchentities API for rare or regional foods. A tiered picker with P31 (instance of food) filters.
  3. Google Cloud Translation v3 (Translation LLM) — the final pass. Re-translates everything to fix scientific-name leakage and apply gender/number agreement.

More — how we name food across 6 languages.

What about states (raw vs cooked)

The same food raw and cooked are two different products nutritionally. Cooked rice has more water and less protein and fewer calories per 100 g than raw. Cooked spinach has higher density of many minerals than raw because the water has gone.

Vnutri handles this with a state-variant model: every food has a state (raw, cooked, dried, baked, etc.) and a groupId shared by all states of the same food. The list view shows one primary (usually raw); the detail view has a state switcher.

More — why "cooked chicken" and "raw chicken" are different foods.

Glycemic index

GI is the only nutrient in Vnutri that doesn't come from the 8 food databases. Source: the Atkinson 2021 meta-analysis (Am J Clin Nutr), International Tables of Glycemic Index and Glycemic Load Values 2021. It's the most complete systematic compilation of GI to date.

Not every food has a measured GI — only carb-containing foods, and only when at least one lab session has been documented in the literature. About 30 % of the catalog carries a GI. See the glycemic index.

What we don't do

  • No paid closed databases. Open license only.
  • No user-contributed data for nutrients. OFF is used for names only.
  • No trust in production label data. Declared calories on packaging can differ from lab values by up to 20 % (FDA tolerance). Lab data is more accurate.
  • No nutrient-by-recipe summing, except for "mixed dishes" from FNDDS.

Accuracy and limits

What to expect.

Data accuracy: for single-ingredient foods — lab-measured values. For FNDDS dishes — recipe-calculated values. For regional dishes — LLM estimates.

Regional variation: an apple in the US, Norway, and Australia is physically different produce. Our median smooths regional effects. If you're analyzing food from a specific region, a local database may be more accurate.

Varietal variation: Honeycrisp ≠ Granny Smith on sugar and acidity. In the catalog, "apple" is the median across varieties. Specific varieties have to be looked up separately.

Cooking: "boiled potato" in the catalog is averaged across boiling methods. Roasted or fried — a different profile.

Data aging: USDA SR Legacy data is from 2018; CoFID 2021; Ciqual 2020. Very recent foods (yacón, lab-grown meat) don't appear quickly.

Attribution

All sources are cited on the Acknowledgments page with their licenses. Every nutrient in a food's detail card can be traced to the source.

If you want to use Vnutri data in your own project, we're open to discussing it. Reach out: hello@vnutri.app.

References

  • US Department of Agriculture. FoodData Central. 2024.
  • Health Canada. Canadian Nutrient File. 2023.
  • Public Health England. McCance and Widdowson's The Composition of Foods Integrated Dataset 2021.
  • ANSES. Ciqual French food composition table. 2020.
  • Norwegian Food Safety Authority. Matvaretabellen. 2023.
  • Food Standards Australia New Zealand. Australian Food Composition Database, Release 3. 2024.
  • Technical University of Denmark. Frida Food Database. 2023.
  • Atkinson FS, Brand-Miller JC, Foster-Powell K, et al. International tables of glycemic index and glycemic load values 2021. Am J Clin Nutr. 2021;114(5):1625–1632.