NPJ digital medicine
Assessing the impact of safety guardrails on large language models using irritability metrics.
Bazen Gashaw Teferra, Nabil Johny, Sandra Huang, Alice Rueda, Mohammad Amin Kamaleddin, Katharine Dunlop, Yanbo Zhang, Manish Jha, Divya Sharma, Venkat Bhat
Published: 202610.1038/s41746-025-02333-3
Abstract
Large language models (LLMs) are increasingly explored for mental health applications, yet their affective realism is shaped by safety guardrails designed to minimize risk. This study examines one affective behaviour, irritability, in LLMs using thre…
Preview only. Read the full abstract at the source