NPJ digital medicine

Assessing the impact of safety guardrails on large language models using irritability metrics.

Bazen Gashaw Teferra, Nabil Johny, Sandra Huang, Alice Rueda, Mohammad Amin Kamaleddin, Katharine Dunlop, Yanbo Zhang, Manish Jha, Divya Sharma, Venkat Bhat

Published: 202610.1038/s41746-025-02333-3

Abstract

Large language models (LLMs) are increasingly explored for mental health applications, yet their affective realism is shaped by safety guardrails designed to minimize risk. This study examines one affective behaviour, irritability, in LLMs using thre…

Preview only. Read the full abstract at the source

View at DOI