Effectiveness of ChatGPT, Google Gemini, and Microsoft Copilot in Answering Thai Drug Information Queries: Cross-Sectional Study.
Suphannika Pornwattanakavee, Nattawut Leelakanok, Teerarat Todsarot, Gabrielle Angele Tatta Guinto, Ratchanon Takun, Assadawut Sumativit, Marisa Senngam
Abstract
Open AccessBACKGROUND: ChatGPT-4o, Google Gemini, and Microsoft Copilot have shown potential in generating health care-related information. However, their accuracy, completeness, and safety for providing drug-related information in Thai contexts remain underexplored. OBJECTIVE: This study aims to evaluate the performance of artificial intelligence (AI) systems in responding to drug-related questions in Thai. METHODS: An analytical cross-sectional study was conducted using 76 public drug-related questions compiled from medical databases and social media between November 1, 2019, and December 31, 2024. All questions were categorized into 19 distinct categories, each comprising 4 questions. ChatGPT-4o, Google Gemini, and Microsoft Copilot were queried in a single session on March 1, 2025, by using input in Thai. All responses were evaluated for correctness, completeness, risk, and reproducibility independently by clinical pharmacists using standardized evaluation criteria. RESULTS: All 3 AI models provided generally complete responses (P=.08). ChatGPT-4o yielded the highest proportion of fully correct responses (P=.08). The overall risk levels of high-risk answers were not significantly different (P=.12). Response correctness was influenced by the category of the drug-related questions (P=.002) but not completeness (P=.23). The correctness of Google Gemini and Microsoft Copilot was higher than that of ChatGPT for pharmacology queries. The type of questions also statistically significantly affected the risk level of the answers (P=.04). In particular, the pregnancy and lactation category had the highest high-risk response rate (1/76, 1% per system). All 3 AI models demonstrated consistent response patterns when the same questions were re-queried after 1, 7, and 14 days. CONCLUSIONS: The evaluated AI chatbots were able to answer the queries with generally complete content; however, we found limited accuracy and occasional high-risk errors in responding to drug-related questions in Thai. All models exhibited good reproducibility.