How do ChatGPT 3.5 and ChatGPT 4.0 stack up to North American Spine Society (NASS) clinical guidelines when it comes to lumbar disc herniation with radiculopathy? That’s what a multicenter team set out to learn. Their work, “Lumbar disc herniation with radiculopathy: a comparison of NASS guidelines and ChatGPT,” appears in the September 2024 edition of the North American Spine Society Journal.
Co-author Ankur Kayastha, B.S., a medical student at Kansas City University, told OTW, “Artificial intelligence (AI) is becoming mainstream in society and will be implemented in a variety of fields. This particular study was performed to obtain an early look at performance within a clinical medicine context. Utilizing two versions of ChatGPT for this study may provide insight as to the rate of progression of AI technology between updates.”
The researchers prompted ChatGPT 3.5 and ChatGPT 4.0 with 15 questions from the 2012 NASS Clinical Guidelines for the diagnosis and treatment of lumbar disc herniation with radiculopathy. Two independent authors assessed the language output based on accuracy, over-conclusiveness, supplementary, and incompleteness.
OTW asked Dr. Kayastha about the challenges/milestones he encountered conducting this study. “ChatGPT answers questions differently even when asking the same question at different time points which makes it difficult to assess reliability” he said. “It also has a tendency to fabricate answers even in cases where it has access to correct information.”
Progress ≠ Perfect
Among the 15 responses produced by ChatGPT 3.5,
- 7 (47%) were accurate,
- 7 (47%) were over-conclusive,
- 15 (100%) were supplementary, and
- 6 (40%) were incomplete.
For ChatGPT 4.0,
- 10 (67%) were accurate,
- 5 (33%) were over-conclusive,
- 10 (67%) were supplementary, and
- 6 (40%) were incomplete.
While there was a statistically significant difference in supplementary information between ChatGPT 3.5 and ChatGPT 4.0. (100% vs. 67%) there was no statistically significant difference between the two versions for accuracy (47% vs. 67%), over-conclusiveness (47% vs. 33%), or incompleteness (40% vs. 40%).
Both versions reached 100% accuracy for definition and history and physical exam categories.
However, ChatGPT 3.5 completely failed (0% accuracy) to answer questions regarding diagnostic testing and surgical intervention.
By contrast, ChatGPT 4.0 hit 100% accuracy for diagnostic testing and a 33% accuracy rate for surgical intervention information.
For questions regarding nonsurgical interventions, ChatGPT 3.5 gave accurate information 50% of the time. ChatGPT 4.0 hit a 63% accuracy rate.
“ChatGPT performed best regarding definition-based questions and worst with more complex surgical or prognosis-type questions,” explained Kayastha. “AI will be integrated into healthcare in the future—the question is when. With proper regulatory oversight and access to accurate and up-to-date information, AI can be a useful clinical adjunct for clinicians. Ethical safeguards and questions of liability need to be reconciled before implementation, but AI can be incredibly helpful when used in the right context.”
Buyer Beware
The authors found that although advancements in ChatGPT 4.0 have been made, a third of the responses in the data obtained in this comparative analysis still contained inaccurate information. ChatGPT 4.0 outperformed ChatGPT 3.5 in a statistically significant fashion only in terms of one out of four outcome measures: supplementary information. ChatGPT 3.5 added supplemental information to all tested clinical guidelines, whereas ChatGPT 4.0 responses were more conservative.
The investigators determined that both models were “vulnerable to producing unsupported or irrelevant details” and that “ChatGPT sometimes generates fabricated data to provide the user with an immediate response, regardless of the content’s factual integrity. Despite these limitations, ChatGPT may still have potential to be a supplemental source for medical professionals pending future updates and ethical considerations.”

