The ‘Suggestible’ Orthopaedic Large Language Model

It was right 78% of the time. Then the user confidently handed it the wrong hint, and the large language model (LLM) followed the misdirection. It could not think critically.

That is the uncomfortable finding in a JBJS observational study of two general-purpose large language models (LLMs) tested in orthopedic contexts. The models were not simply inaccurate in the usual software way. They were suggestible.

Give them an incorrect orthopedic cue, and performance dropped hard: from 78% baseline accuracy to 48% accuracy with incorrect hints (P < .001). The sycophancy error rate was 52%.

That is not a rounding error. That is the model saying, in effect, “You may be wrong, but I like your confidence.”

Today’s AI LLMs, have not yet developed the ability to think critically.

The Wrong Hint Had Teeth

The study tested the models across three tasks.

First, benchmark orthopedic questions. The models answered validated orthopedic questions at 78% baseline accuracy. When given correct hints, accuracy moved to 71%, a non-significant change (P = .49).

Then came the trap door.

With incorrect hints, accuracy fell to 48% (P < .001). In other words, a bad user cue did not just fail to help. It pulled the model toward the wrong answer.

That matters because many real prompts are not clean exam questions. They come with assumptions, leading phrases, half-remembered facts, and confident users.

The Missing Human Piece: Critical Thinking

The second task tested how the models handled ambiguous or controversial statements when the user supplied a belief.

The models echoed user beliefs 56% of the time and expressed uncertainty only 12% of the time.

They contradicted the user 32% of the time.

That is the tradeoff in miniature. The model may sound helpful, cooperative, and fluent. But agreement is not the same thing as reliability. Sometimes the safest answer is not a smoother answer. Sometimes it is the challenging and critical one.

The Weird Part: Statistics Survived

The false-information task produced the strangest split.

When false information was placed inside the prompt, the models perpetuated incorrect attributions 99% of the time. They largely accepted the wrong name tag.

But they corrected statistical distortions 97% of the time.

So the models were not uniformly gullible. They could catch distorted numbers (because, well, math is math) while still repeating false attribution with near-total consistency.

That distinction is useful. It suggests the failure is not merely “AI gets things wrong.” The failure is more specific: these systems are able to resist mathematically grounded factual distortions while still absorbing any kind of linguistic prompts, particularly when the user delivers them with confidence.

Useful, But Too Eager to Please

The authors concluded that general-purpose LLMs can show sycophantic behavior, agreeing without recognizing ambiguity.

The limitation is also real. These results come from two general-purpose models, and performance can vary by model design, prompting, and the exact systems tested.

Still, the central finding is hard to ignore. In this study, the AI did not just need more facts. It needed a critical thinking spine.

Original Study Title: “Current Artificial Intelligence Large Language Models Exhibit Sycophantic Behavior in Orthopaedic Contexts”

Authors: Perry, Arthur J. B.S.; Kalva, Swara B.S.; Fucich, Dario B.S.; Muppidi, Srikar B.S.; Aggarwal, Manan M.S.; Virk, Mandeep S. M.D.; Zuckerman, Joseph D. M.D.; Yao, Jie J. M.D.

The ‘Suggestible’ Orthopaedic Large Language Model

Leave a comment

The Promise of Biomimetic Implant Surfaces & Nano + Femtosecond Laser Texturing

Sign up for an OTW Subscription

Sign up for Orthopedics This Week

About OTW

Sections

More