Sitemap
Press enter or click to view image in full size

Mind Your Manners! Why Rude Prompts Might Actually Make AI Smarter

4 min readOct 23, 2025

--

In Short

In a new 2025 study, impolite prompts boosted ChatGPT-4o’s accuracy on tough multiple-choice questions by up to 4%, from 80.8% (very polite) to 84.8% (very rude).

Keep in mind, older models like GPT-3.5 preferred politeness.

But, results will vary with different LLM, as shown in a January 2025 thesis on open-source alternatives.

As AI advances, tone sensitivity could fade — newer models like o1 already navigate niceties better.

The take away from the study is to include tone testing into prompt benchmarks and AI evals to build robust systems that handle real-world variations.

As AI infiltrates everything from homework help to boardroom brainstorming, understanding tone isn’t just etiquette — it’s efficiency.

Press enter or click to view image in full size
Source

Background

The craft of coaxing the best from large language models (LLMs) has been a area of focus for a while now…

Since OpenAI’s GPT-3.5 dropped in late 2022, we’ve known that tiny tweaks in wording can swing outputs on accuracy, creativity, or length.

But politeness?

With models ballooning to trillions of parameters, does “please” still matter, or is it noise in the neural net?

Early tests suggested being nice aligns with human-like interactions, potentially unlocking better results.

A 2024 cross-lingual study found overly rude prompts dropped performance on GPT-3.5 (down to 57% accuracy), while moderate politeness edged out wins.

It made sense, LLMs, trained on polite internet scraps, might mirror our social graces.

Rudeness Wins (For Now)

The October 2025 Penn State short paper investigated how prompt politeness affects LLM accuracy.

They crafted 50 tricky multiple-choice questions across math, science and history — think multi-step brain-teasers like genetics probabilities or historical what-ifs.

Each got five tones:

  1. Very Polite
  2. Polite,
  3. Neutral,
  4. Rude
  5. Very Rude

Fed to ChatGPT-4o, the results?

Impoliteness worked better.

Very Rude prompts hit 84.8% accuracy across 10 runs, trouncing Very Polite’s 80.8%.

Neutral clocked 82.2%, with statistical zingers (p-values under 0.05) confirming rude tones reliably outscored polite ones in eight pairwise t-tests.

Why?

The authors speculate newer models, hardened by diverse training, treat rudeness as “directive” fuel — cutting fluff and zeroing in on facts — without the emotional baggage that polite phrasing might trigger.

It Might Change Over Time

Tone’s magic (or menace) isn’t etched in silicon.

The Penn State prelims hinted at a “cost-performance tradeoff”: Less advanced models like Claude fumbled more under rudeness, while cutting-edge ones like ChatGPT o1 largely ignored it, delivering top-tier results regardless.

Effects flip by model — polite prompts lifted multilingual Qwen2.5’s scores linearly, but tanked English-centric Llama-3.1’s by a full point on quality scales.

As LLMs evolve, RLHF tweaks could neuter tone biases altogether.

Tone Testing & AI Evals

Prompt benchmarking — your A/B lab for outputs — should treat tone as a core variable, alongside examples or length.

Enterprise rollouts could flop if prompts brittle under rude colleagues, while safety audits miss how tone amplifies hallucinations.

Future-proof by profiling models early.

Run tone suites during selection, then iterate in production logs. It’s not just accuracy — it’s inclusive AI that hums with humanity, minus the toxicity.

Press enter or click to view image in full size

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. Language Models, AI Agents, Agentic Apps, Dev Frameworks & Data-Driven Tools shaping tomorrow.

Press enter or click to view image in full size

--

--

Cobus Greyling
Cobus Greyling

Written by Cobus Greyling

I’m passionate about exploring the intersection of AI & language. www.cobusgreyling.com

No responses yet