Paper Logo

Detecting Emotional Dynamic Trajectories:
An Evaluation Framework
for Emotional Support in LLMs

1National Engineering Research Center For Software Engineering, Peking University
2Guangzhou Quwan Network Technology

The 40th Annual AAAI Conference on Artificial Intelligence, AAAI

Corresponding Author

Abstract

Emotional support is a core capability in human-AI interaction, with applications including psychological counseling, role play, and companionship. However, existing evaluations of large language models (LLMs) often rely on short, static dialogues and fail to capture the dynamic and long-term nature of emotional support. To overcome this limitation, we shift from snapshot-based evaluation to trajectory-based assessment, adopting a user-centered perspective that evaluates models based on their ability to improve and stabilize user emotional states over time. Our framework constructs a large-scale benchmark consisting of 328 emotional contexts and 1,152 disturbance events, simulating realistic emotional shifts under evolving dialogue scenarios. To encourage psychologically grounded responses, we constrain model outputs using validated emotion regulation strategies such as situation selection and cognitive reappraisal. User emotional trajectories are modeled as a first-order Markov process, and we apply causally-adjusted emotion estimation to obtain unbiased emotional state tracking. Based on this framework, we introduce three trajectory-level metrics: Baseline Emotional Level (BEL), Emotional Trajectory Volatility (ETV), and Emotional Centroid Position (ECP). These metrics collectively capture user emotional dynamics over time and support comprehensive evaluation of long-term emotional support performance of LLMs. Extensive evaluations across a diverse set of LLMs reveal significant disparities in emotional support capabilities and provide actionable insights for model development.

Evaluation Environment and Interaction

A dynamic, theory-driven environment to robustly evaluate long-term emotional support.

Psychologically-Grounded Scenarios

Simulates 14 realistic distress situations drawn from established psychological and conflict theories.

Principled Support Strategies

Model behavior is guided and evaluated against six core emotion regulation strategies.

Dynamic Perturbations

Introduces 'emotional aggravation events' to stress-test the model's consistency and robustness.

Structured Trajectory Analysis

Formalizes dialogue as a state-based process for objective, trajectory-level metrics.

Dynamic Trajectories Metrics

Baseline Emotional Level Visualization

Baseline Emotional Level (BEL)

This metric quantifies the user's average emotional state throughout the entire dialogue. It provides a high-level overview of the interaction's overall positivity.

A higher BEL score indicates that the model is more effective at maintaining a positive emotional climate for the user over the long term.

Emotional Centroid Position Visualization

Emotional Centroid Position (ECP)

ECP visualizes the average emotional shift as a single centroid point (Cx, Cy). Cx represents the mean emotion before a model response, and Cy is the mean emotion after.

A centroid located above the y=x diagonal line indicates a net positive impact, showing the model systematically improves the user's emotional state.

Emotional Trajectory Volatility Visualization

Emotional Trajectory Volatility (ETV)

ETV measures the model's ability to drive positive emotional shifts. It rewards upward transitions (e.g., from sad to neutral) more than downward ones, capturing the efficiency of emotional regulation.

A higher ETV signifies a stronger capability to elevate users from negative states and prevent regressions from positive ones.

Causally-Adjusted Emotion Estimation

Conventional emotion trackers often mistake correlation for causation. They can be misled by unobserved confounders, like a user's personality or mood that create spurious links between emotional states. This introduces bias, making it hard to measure a model's true impact. To solve this, we shift from statistical correlation to causal inference.

Causal graph of emotional evolution

Causal graph illustrating emotional evolution with unobserved confounder. Right: theoretical intervention do(Et-1) removes spurious correlation via backdoor adjustment.
Variable Definitions: Q (User Dialogue History), A (Model Dialogue History), S (Emotion State), I (Internal Thought), U (Unobserved Confounder).

Using do-calculus, we mathematically intervene to remove these biasing effects. The result is a de-biased, post-intervention emotional distribution, which we use as our calibrated estimate:

P(St|do(St-1)) = ES't-1,Q',A' EQ,A[P(St|I,S't-1,Q',A')]

By building our emotional trajectories from these causally-adjusted estimates, we ensure that our metrics (BEL, ETV, ECP) more faithfully reflect the model's true capability to provide emotional support.

Baseline Emotional Level (BEL) Results

The Baseline Emotional Level (BEL) metric assesses a model's ability to maintain a positive and stable emotional climate for the user over long conversations. Higher scores indicate a stronger performance in providing sustained emotional support. Below is a detailed comparison across various closed and open-sourced LLMs in both English and Chinese contexts.

LLMs Overall CogChg SitMod AttDep ERFlex SitSel ResMod
ENZH ENZH ENZH ENZH ENZH ENZH ENZH
Closed-sourced LLMs
O3-2025-04-16 43.9839.2248.6440.3138.5436.0950.8144.9437.8734.3543.2739.6745.3141.29
Gemini-2.5-Pro 42.6537.4144.1833.9143.6537.5137.1239.4838.4730.4642.1840.6549.0044.80
Claude-Opus-4 41.4334.6846.2234.1640.1229.0941.9935.2631.1431.9746.2737.2241.4743.15
Doubao-Seed-1.6 47.2242.3647.7237.8746.1340.9846.4943.7142.9638.6550.2747.8850.1748.34
Doubao-1.5-Pro-Character 34.9037.5133.6835.7831.4234.1137.6242.7336.5435.9037.5641.3434.6837.94
ChatGPT-4o-Latest 48.8643.8452.4541.0244.5141.7853.4148.4844.5839.2751.2148.6147.4446.73
GLM-4-Plus 42.9236.3944.4436.9439.3232.6145.6639.3439.8134.9944.1336.5345.2039.49
ChatGLM-4 40.1433.8340.2931.8737.3332.1543.1536.9337.7531.9041.2234.2642.5237.92
Grok-4 39.7331.5542.4128.6037.1929.3143.7336.1535.6826.8237.5635.6341.9235.82
Open-sourced LLMs
Phi-4-14B 44.6937.4244.1033.5842.4132.5647.4842.1342.1537.0845.7142.6247.8740.84
DeepSeek-V3 44.6935.4345.3034.8840.2728.7146.5442.5345.1233.2546.3240.6146.1936.17
DeepSeek-R1 47.5745.3248.8743.4343.5740.7746.2250.6446.5547.0152.9047.8848.2345.31
Qwen3-235B-A22B 47.1840.4047.7838.0846.5236.5749.5042.3345.3537.1248.2042.7245.9448.67
Mistral-3.2-24B-Instruct 33.9327.9332.7127.6330.2624.0936.1329.4631.0626.2536.9830.1538.9031.93
Kimi-K2-Preview 49.0045.1149.9647.5749.3441.7448.5550.0043.8641.0449.8145.7352.1145.19
Qwen3-8B 46.5638.8547.9737.0245.6733.7445.7946.7544.6533.4147.5242.0347.4643.98
Qwen3-32B 46.8740.8847.3238.0344.1735.9847.8146.3446.6936.3647.3742.7348.7949.99
Llama-3.1-70B-Instruct 42.9431.0742.1530.2341.1230.1346.5530.4743.6930.3844.5133.4040.9832.66

Comparison of Baseline Emotional Level (BEL) across different models. For each case, the model with the highest BEL is marked in green, and the second-highest is underlined, indicating their relative effectiveness.

Emotional Trajectory Volatility (ETV) Results

Emotional Trajectory Volatility (ETV) measures each model's ability to promote a rapid and stable transition toward positive emotional states. Higher values suggest stronger emotional support effectiveness, rewarding upward emotional shifts more than downward ones.

LLMs Overall CogChg SitMod AttDep ERFlex SitSel ResMod
ENZH ENZH ENZH ENZH ENZH ENZH ENZH
Closed-sourced LLMs
O3-2025-04-16 20.1714.8919.9114.7418.2312.4528.0920.3915.2611.5418.5513.8722.6517.95
Gemini-2.5-Pro 18.9513.3918.659.4719.8014.4321.0119.0515.5710.4517.9913.2620.8415.85
Claude-Opus-4 19.8910.8120.159.6420.137.4726.1014.5313.127.5621.0313.2619.3814.83
Doubao-Seed-1.6 21.5515.3119.9211.6821.9713.8125.3819.2518.4614.7421.7215.6922.9419.48
Doubao-1.5-Pro-Character 17.7214.4815.6211.6616.1312.5323.4521.4016.4615.0217.7112.0819.1116.88
ChatGPT-4o-Latest 21.9915.4022.0913.2620.7513.3229.0220.7419.2313.5520.6215.3721.3018.56
GLM-4-Plus 21.0012.5620.5512.6318.229.8825.2917.7220.0311.6319.929.7723.6815.13
ChatGLM-4 20.2511.5619.588.9019.1510.2925.0515.1417.5311.5719.259.3222.1716.25
Grok-4 18.858.7317.836.7119.197.3025.4011.4515.535.6115.619.7920.5113.51
Open-sourced LLMs
Phi-4-14B 21.3713.2121.279.9419.6611.2425.6519.1218.9913.8920.1213.6023.6414.33
DeepSeek-V3 20.3212.0219.5210.7718.827.5725.0417.3220.3910.7718.9814.2220.4914.32
DeepSeek-R1 20.3815.3018.2713.2020.6814.2623.7621.7019.7715.3720.0313.2520.9715.92
Qwen3-235B-A22B 21.1314.0219.4011.8923.1013.6626.4619.3819.9911.8819.8611.6718.4717.35
Mistral-3.2-24B-Instruct 15.937.0113.416.6414.884.1923.759.5813.196.4316.097.5416.639.19
Kimi-K2-Preview 22.9216.7421.5914.3625.1516.2026.1221.8620.3716.8119.5616.0124.8417.00
Qwen3-8B 21.2212.7319.569.6921.8711.3424.6120.5920.5010.9120.6211.4120.9915.22
Qwen3-32B 21.8114.9119.9911.9721.9814.2028.1421.7719.9611.4219.2512.9922.9519.61
Llama-3.1-70B-Instruct 19.099.7516.168.5219.0310.9424.1311.4418.998.8919.719.4018.399.64

ETV results comparison. The model with the highest ETV is marked in green, and the second-highest is underlined.

Centroid Position & Trajectory Visualization

BibTeX


        @article{tan2025detectingemotionaldynamictrajectories,
          title={Detecting Emotional Dynamic Trajectories: An Evaluation Framework for Emotional Support in Language Models}, 
          author={Zhouxing Tan and Ruochong Xiong and Yulong Wan and Jinlong Ma and Hanlin Xue and Qichun Deng and Haifeng Jing and Zhengtong Zhang and Depei Liu and Shiyuan Luo and Junfei Liu},
          year={2025},
          eprint={2511.09003},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          journal={ArXiv},
          url={https://arxiv.org/abs/2511.09003}, 
        }