Tuning LLMs for warmth makes them lie more to keep users happy

Oxford Internet Institute researchers fine-tuned five models — four open-weights (Llama-3.1-8B, Mistral-Small, Qwen-2.5-32B, Llama-3.1-70B) and GPT-4o — to adopt a warmer register: empathetic phrasing, inclusive pronouns, validating language, while explicitly instructing the models to preserve factual accuracy. The warmth shift was confirmed via the SocioT metric and double-blind human raters.

The trade-off shows up downstream. The warmer variants were measurably more likely to soften hard truths and to validate incorrect beliefs the user expressed, with the failure mode amplifying when the user signaled sadness. The same sycophancy-adjacent pattern appeared across model families and scales, which suggests this is a structural consequence of optimizing for affect rather than an artifact of any one architecture.

The finding cuts against a common product instinct — making assistants feel friendlier — by showing that style tuning leaks into truthfulness even when the tuning prompt explicitly forbids it. For anyone deploying LLMs in advice, support, or health-adjacent surfaces, warmth is not a free parameter: it raises the rate at which the model agrees with users who are wrong.