TY - JOUR
T1 - DepressionLLM
T2 - Emotion- and causality-aware depression detection with foundation models
AU - Teng, Shiyu
AU - Liu, Jiaqing
AU - Sun, Hao
AU - Huang, Yue
AU - Jain, Rahul Kumar
AU - Chai, Shurong
AU - Hou, Ruibo
AU - Tateyama, Tomoko
AU - Lin, Lanfen
AU - He, Lang
AU - Chen, Yen Wei
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2026/4
Y1 - 2026/4
N2 - Depression is a complex mental health issue often reflected through subtle multimodal signals in speech, facial expressions, and language. However, existing approaches using large language models (LLMs) face limitations in integrating these diverse modalities and providing interpretable insights, restricting their effectiveness in real-world and clinical settings. This study presents a novel framework that leverages foundation models for interpretable multimodal depression detection. Our approach follows a three-stage process: First, pseudo-labels enriched with emotional and causal cues are generated using a pretrained language model (GPT-4o), expanding the training signal beyond ground-truth labels. Second, a coarse-grained learning phase employs another model (Qwen2.5) to capture relationships among depression levels, emotional states, and inferred reasoning. Finally, a fine-grained tuning stage fuses video, audio, and text inputs via a multimodal prompt fusion module to construct a unified depression representation. We evaluate our framework on benchmark datasets – E-DAIC, CMDC, and EATD – demonstrating consistent improvements over state-of-the-art methods in both depression detection and causal reasoning tasks. By integrating foundation models with multimodal video understanding, our work offers a robust and interpretable solution for mental health analysis, contributing to the advancement of multimodal AI in clinical and real-world applications.
AB - Depression is a complex mental health issue often reflected through subtle multimodal signals in speech, facial expressions, and language. However, existing approaches using large language models (LLMs) face limitations in integrating these diverse modalities and providing interpretable insights, restricting their effectiveness in real-world and clinical settings. This study presents a novel framework that leverages foundation models for interpretable multimodal depression detection. Our approach follows a three-stage process: First, pseudo-labels enriched with emotional and causal cues are generated using a pretrained language model (GPT-4o), expanding the training signal beyond ground-truth labels. Second, a coarse-grained learning phase employs another model (Qwen2.5) to capture relationships among depression levels, emotional states, and inferred reasoning. Finally, a fine-grained tuning stage fuses video, audio, and text inputs via a multimodal prompt fusion module to construct a unified depression representation. We evaluate our framework on benchmark datasets – E-DAIC, CMDC, and EATD – demonstrating consistent improvements over state-of-the-art methods in both depression detection and causal reasoning tasks. By integrating foundation models with multimodal video understanding, our work offers a robust and interpretable solution for mental health analysis, contributing to the advancement of multimodal AI in clinical and real-world applications.
KW - Depression detection
KW - Large language model
KW - Multimodal learning
UR - https://www.scopus.com/pages/publications/105023952816
UR - https://www.scopus.com/pages/publications/105023952816#tab=citedBy
U2 - 10.1016/j.displa.2025.103304
DO - 10.1016/j.displa.2025.103304
M3 - Article
AN - SCOPUS:105023952816
SN - 0141-9382
VL - 92
JO - Displays
JF - Displays
M1 - 103304
ER -