The American journal of clinical nutritionHumansDiet RecordsTaiwanMaleFemale

Customized multimodal Diabot-GPT-4o enhances accuracy of image-based dietary assessments in dietetic trainees in Taiwan: validation against weighed food records.

Yu Jie Chen, Chun-Chao Chang, Yen Nhi Hoang, Annie W Lin, Wen-Ling Lin, Cheng-Yu Lin, Ellyn Patricia, Janice Clarisa Tissadharma, Jovan Kuanca, Natasya Nobelta, Kimberly Alecia Theo, Dang Khanh Ngan Ho, Pin-Hui Wei, Jung-Su Chang

Published: 202510.1016/j.ajcnut.2025.10.013

Abstract

Open Access

BACKGROUND: Automated image-based dietary assessments (IBDAs) using multimodal artificial intelligence (AI) chatbots show strong potential. However, sources of error at the human-AI interface in real-world use remain unclear. OBJECTIVES: In this study, we validated a GPT-4o-powered chatbot for automated IBDAs and identified key sources of error in free-living settings. METHODS: In total, 714 food images were collected from 3-d weighed food records (WFRs) across 171 d from 57 young adults. Images were analyzed using 4 AI configurations: Diabot (DB), DBFN (customized GPT-4o), 4o, and 4oFN (noncustomized), where "FN" indicates inclusion of the food name input. Portion sizes and nutrient estimates were compared with WFRs using Bland-Altman plots with equivalence testing at ±10%, ±15%, and ±20% bounds. RESULTS: Using images alone, DB recognized 74% of food items versus 59% for 4o. All AI configurations provided accurate estimates of portion sizes (±10%-15%, coefficient of variation [CV]: 13%), energy (±10%-20%, CV: 14%), and carbohydrates (CHOs; ±15%-20%, CV: 15%) but showed less consistency for fats (±10%-22%, CV: 24%) and proteins (±10%->20.2%, CV: 18%). The custom DBFN outperformed 4oFN, achieving higher accuracy across more nutrients within the ±10% (weight, energy, fats, saturated fats, potassium, and magnesium), ±15% (proteins and sodium), and ±20% (CHOs and calcium) bounds and achieved the highest agreement with WFRs (Spearman's ρ = 0.863-0.662; Lin's concordance correlation coefficient = 0.874-0.540). Common errors at the human-AI interface included inaccurate portion-size estimates, obscured food visibility in images, poorly constructed prompts, omission or intrusion errors, and system-specific limitations, such as processing overload and configuration inconsistencies. CONCLUSIONS: Customized AI chatbots improved automated IBDAs, yet accuracy depends on clear images for food visibility and portion-size fidelity. Standardized AI-input procedures (FN, cooking state, prompt structure, and configuration) and expert oversight to detect and correct AI hallucinations (fabricated items, units, or quantities) remain essential for reliable, interpretable estimates.