Reduce admin costs and stop revenue leakage. Get a free AI consultation →

Research

10 min readJanuary 28, 2026

LLMs in Healthcare: What Works, What Doesn't

Large Language Models show promise in healthcare but not for everything. We analyze what LLMs excel at versus where they struggle, based on production data from 200+ healthcare organizations.

Dr. Emily Walsh

Jan 28, 2026

On This Page

The Reality of LLMs in Healthcare

Large Language Models have captured healthcare's imagination. ChatGPT, Claude, and other LLMs demonstrate impressive capabilities with language, reasoning, and information synthesis. But not every healthcare task benefits from LLM technology. Understanding what LLMs actually do well versus where they fall short is critical for making good investment decisions.

We analyzed implementation data from over 200 healthcare organizations using various LLM applications. Some projects delivered 2-3x ROI within 6 months. Others failed to gain adoption or produced unreliable results. The difference came down to task fit: whether the specific healthcare problem was actually suited to LLM capabilities.

What LLMs Are Actually Good At

LLMs excel when the task involves understanding text, generating text, or reasoning over information that's already been gathered. They work best for tasks where there's tolerance for occasional errors as long as accuracy is high enough to be useful.

Document analysis and summarization of unstructured notes
Clinical text generation for documentation support
Information retrieval and question answering
Patient communication and education
Medical coding assistance and documentation review
Identifying relevant information in medical literature

Tasks Where LLMs Succeed: Document Summarization

One of the strongest healthcare applications of LLMs is summarizing clinical documentation. A patient with a complex medical history might have 100+ pages of notes across multiple encounters. An LLM can quickly summarize these into relevant information for the current visit.

In our research, LLM-based note summarization achieved 91% accuracy when compared to human-created summaries. More importantly, clinicians reported it was useful for 85% of encounters. The summarization saved an average of 3-5 minutes per encounter, which compounds significantly across a busy practice.

Success factors for note summarization: the model is working with complete information (you're not asking it to diagnose), it can tolerate occasional errors because a human reviews the summary, it works with well-documented cases, and it integrates into existing workflows.

Clinical Documentation Support

LLMs can assist clinicians in creating documentation by suggesting complete notes based on encounter information, previous notes, or dictation. A clinician might record a brief summary of an encounter, and the LLM suggests a complete, properly formatted note.

The key to success here is that the clinician retains full responsibility for the note. They review what the LLM generated, edit as needed, and sign off. This is fundamentally different from having the LLM create notes independently.

In practices using LLM-assisted documentation, note completion time decreased 30-40% and documentation quality improved because clinicians had templates and suggestions to work from. However, this required training and workflow redesign. Practices that just gave clinicians access to an LLM without redesigning the documentation workflow saw minimal impact.

Patient Communication and Education

LLMs excel at generating clear, patient-friendly explanations of medical conditions, medications, and procedures. A patient diagnosed with atrial fibrillation needs education about the condition. An LLM can generate that education material personalized to the patient's reading level and health literacy.

This is particularly valuable for practices managing large patient populations. Rather than staff writing individualized explanations for each patient condition, LLMs can generate personalized materials at scale. Organizations using this approach report higher patient compliance and understanding.

Where LLMs Struggle: The Limitations

Despite their impressive capabilities, LLMs have significant limitations in healthcare applications. Understanding these limitations prevents expensive implementation failures.

Problem 1: Medical Knowledge Cutoff

LLMs have a knowledge cutoff date. They were trained on data up to a certain point and don't learn from new information without retraining. In medicine, new guidelines, treatments, and evidence emerge constantly. An LLM trained on data from 2023 doesn't know about 2025 clinical guidelines.

This creates serious problems for any application where current medical knowledge is critical. Using an LLM to suggest diagnoses, prescribe medications, or recommend treatments based on outdated medical knowledge can be dangerous. Several healthcare organizations learned this the hard way when LLMs generated recommendations that conflicted with current guidelines.

Problem 2: Hallucination and Unreliability

LLMs sometimes generate plausible-sounding but completely fabricated information. This phenomenon, called hallucination, happens across all LLMs. An LLM might cite a medical study that doesn't exist, suggest a medication that doesn't exist, or make up statistics.

In applications where hallucinations are caught by human review, this is manageable. A clinician reviewing an LLM-generated summary will notice if it claims a patient was diagnosed with a disease they weren't actually diagnosed with. But in applications where there's no human review, hallucinations create serious problems.

Our research found that LLMs hallucinate in about 5-15% of outputs depending on the task complexity. For high-stakes tasks like diagnosis or treatment recommendations, this error rate is unacceptable without human oversight.

Problem 3: Reasoning About Specific Patient Data

LLMs work with general knowledge but struggle with specific reasoning about individual patient data. They might know general information about diabetes, but they're not good at taking specific patient data and reasoning through it to reach conclusions.

For example, an LLM can explain what atrial fibrillation is, but it shouldn't be trusted to decide whether a specific patient needs anticoagulation. That requires detailed medical reasoning about that specific patient's situation, comorbidities, and risk factors.

Task Type	LLM Effectiveness	Reason	Deployment Approach
Document summarization	High (88-94% accuracy)	LLM excels at extracting and condensing text	With human review
Patient education generation	High (85-92% usefulness)	Clear task, well-defined scope	Can run without review if reviewed for accuracy first
Prior authorization requests	Medium (70-80% complete)	Complex rules but mostly template-based	LLM creates drafts, human submits
Clinical decision support	Low (40-60% reliable)	Requires specific patient reasoning, current knowledge	Should not use without careful safeguards
Diagnosis suggestion	Low (50-65% accurate)	Requires specific patient data reasoning, updates	Research only, never in clinical use
Medical coding assistance	Medium-High (75-85% accuracy)	Rule-based with documentation to work from	Human coder reviews and finalizes

Problem 4: Bias and Fairness Issues

LLMs are trained on large datasets that reflect historical biases in medicine. Studies have documented that LLMs can perpetuate racial, gender, and socioeconomic biases in healthcare recommendations.

If an LLM is trained on data where certain populations were underrepresented in clinical trials or received different treatment patterns, the LLM might recommend different treatments for similar patients based on demographic characteristics. This is not just unethical; it's illegal under anti-discrimination laws.

Problem 5: Privacy and Data Security Concerns

When you send patient data to an LLM, that data becomes part of the interaction. Depending on the LLM provider and configuration, that data might be used for model improvement, stored on external servers, or visible to third parties.

Many healthcare organizations use public LLM APIs like ChatGPT for healthcare tasks without realizing they're sending protected health information to external servers. This violates HIPAA and creates liability.

For healthcare applications, you need to use LLMs configured for enterprise use with proper data handling, or use open-source LLMs you can run on your own infrastructure.

Specific Healthcare Applications: Success and Failure Cases

Let's examine specific healthcare applications and why some succeed while others fail with LLMs.

Prior Authorization: The Mixed Case

Prior authorization is a process where insurance requires approval before providing certain treatments. It's administrative, repetitive, and rule-based. Sounds like a good LLM application. And partially, it is.

LLMs can successfully draft prior authorization requests by pulling relevant information from the clinical record and formatting it according to payer requirements. In our research, LLM-assisted prior authorization created usable drafts 72% of the time. However, clinicians still needed to review and edit the requests in 28% of cases, and in some complex situations, starting from a draft didn't save time compared to writing from scratch.

The better approach for prior authorization is using structured data and rules-based systems rather than LLMs. When prior authorization requirements are encoded as structured data rules, accuracy exceeds 95%. This is more reliable than LLMs for this specific task.

Diagnostic Decision Support: Where LLMs Fail

Using LLMs for diagnostic suggestions is widely viewed as appropriate in research and academic settings. In clinical practice, it's much more problematic. A study of LLM diagnostic performance found that LLMs suggested the correct diagnosis only 60-70% of the time, even when provided with complete patient information.

More concerning: clinicians often trusted the LLM's suggestions even when they were wrong. LLMs are confident-sounding, so they're persuasive. A clinician without deep expertise in a particular condition might accept an LLM suggestion even if it's incorrect.

Several healthcare organizations attempted to deploy LLM-based diagnostic support. All of them ultimately scaled back or discontinued these deployments due to reliability concerns. The liability risk of an LLM making an incorrect diagnostic suggestion is too high.

Medication Information and Drug Interactions: Risky

Medication information seems like a good LLM use case. Medications have well-defined information: indications, contraindications, dosing, side effects. But LLMs hallucinate about medications. They create drug interactions that don't exist and give incorrect dosing information.

While an LLM might provide correct information 90% of the time about medications, healthcare simply can't tolerate 10% error rates for drug information. A clinician consulting an LLM for drug interaction information is better off checking an actual drug reference database.

Patient Triage: Promising but Requires Care

Using LLMs to help triage patient calls or messages to the right department shows promise. An LLM can read a patient's message and classify it as urgent, routine follow-up, medication refill, etc.

In our research, LLMs classified patient messages correctly 82-88% of the time. This is good enough to be useful if there's human review. A staff member reviews the LLM's triage decision and can override it if needed. This application has been deployed successfully in several practices.

Critical limitation: Patient safety depends on appropriate triage. Any LLM-based triage system must be designed to err on the side of caution. If there's any indication the patient might be experiencing an emergency, the system should route to urgent care. The 12-18% error rate is concerning when mistakes could delay urgent care.

Designing Safe LLM Implementations

If you decide to implement LLMs in your healthcare practice, certain design principles are essential for safety and effectiveness.

Principle 1: Human Review Is Not Optional

Any LLM output that affects patient care must be reviewed by a qualified human before implementation. Never deploy an LLM in a situation where its output goes directly into the medical record or directly to a patient without human review.

This review should be done by someone qualified to evaluate the output. A clinician reviewing an LLM-generated summary needs to know what they're looking for. They need to verify facts, check for hallucinations, and ensure the summary is accurate.

Build human review into every workflow
Specify who is qualified to review each type of output
Define what they should look for and what constitutes acceptable output
Track instances where the LLM output needed correction
Use those corrections to improve prompts and implementation

Principle 2: Use Specialized Models When Available

General-purpose LLMs like ChatGPT work reasonably well across many tasks, but specialized models trained on healthcare data often perform better for healthcare tasks. Clinical-specific LLMs have been fine-tuned on medical knowledge and perform better on medical tasks than general models.

If you're implementing LLMs for healthcare, research whether specialized healthcare LLMs exist for your use case. They often have better accuracy and fewer hallucinations than general models.

Principle 3: Augment, Don't Replace, Human Judgment

The most successful healthcare LLM implementations augment human work rather than replacing it. An LLM generates a summary that a clinician uses, rather than a clinician being replaced by an LLM summary. The human is still in the loop.

This approach also reduces the pressure to make LLM outputs perfect. You're aiming for output that's good enough to be useful, not output that can be trusted completely on its own.

Principle 4: Data Privacy and Security

Before implementing any LLM in healthcare, understand how your data will be handled. Never use consumer LLM APIs with patient data. If using cloud-based LLM services, ensure they have business associate agreements and HIPAA compliance.

Consider running open-source LLMs on your own infrastructure if you have the technical capability. This gives you complete control over data privacy and security.

Understand where data is stored and processed
Verify HIPAA compliance and BAAs for cloud services
Don't send full patient records to LLMs when summaries would suffice
Consider open-source models for sensitive use cases
Implement strong access controls on LLM systems

Principle 5: Ongoing Monitoring and Validation

LLMs can degrade in performance over time or behave unexpectedly in edge cases. Implement monitoring to catch problems early. Track instances where LLM output was incorrect or required extensive rework by humans.

Set up quarterly reviews where you examine LLM performance and decide whether to continue, modify, or discontinue the implementation.

Implementation Phase	Key Considerations	Common Mistakes
Planning	Define use case, task fit, expected outcomes	Assuming all healthcare tasks are good LLM applications
Design	Plan human review, set accuracy standards, ensure privacy	Underestimating data privacy requirements
Piloting	Test with real data, measure outcomes, gather feedback	Scaling too quickly before validating
Deployment	Train staff, monitor performance, establish oversight	Deploying without adequate staff training
Ongoing	Monitor accuracy, adjust prompts, maintain oversight	Setting and forgetting without ongoing management

The Future: Where LLM Healthcare Applications Are Heading

LLM technology continues to improve. Future models will have larger knowledge bases, fewer hallucinations, better reasoning about specific data, and better understanding of medical concepts. But certain limitations are fundamental to how LLMs work.

The future likely involves hybrid approaches where LLMs are combined with other AI technologies. A prior authorization system might use an LLM for understanding clinical notes, structured data systems for rules, and traditional machine learning for prediction. This combination is more powerful than any single technology alone.

Multimodal AI

Current LLMs work with text. Future systems will work with text, images, audio, and data simultaneously. A multimodal system might review a patient's imaging, clinical notes, and lab results together to provide insights. This could be powerful for specific healthcare applications like radiology or pathology.

Personalized Models

As healthcare organizations collect more data about their outcomes, they might train specialized LLMs on their own data. A hospital system could train an LLM on their own patient data and clinical outcomes, creating a model tailored to their patient population and practices. This would likely perform better on their specific cases than a general model.

Regulatory Framework Development

As healthcare AI matures, regulatory frameworks will become clearer. FDA guidelines on AI in healthcare are still developing. Organizations deploying healthcare AI today are navigating regulatory ambiguity. As regulations solidify, it will be clearer what validation and testing is required for different applications.

Practical Recommendations

Based on our analysis of 200+ healthcare organizations, here are the key takeaways for your organization.

Start with high-confidence LLM applications: document summarization, patient education, note drafting with human review
Avoid low-confidence applications: independent diagnosis, medication recommendations, clinical decision making
Implement strong human oversight: never deploy LLM output without qualified human review on patient-facing applications
Prioritize data privacy: use enterprise-grade LLM services with proper compliance, or run open-source models on your infrastructure
Plan for integration: LLMs work best when integrated into existing workflows, not as standalone tools
Monitor continuously: track LLM performance and be ready to scale back or discontinue if outcomes don't justify the effort
Stay updated: healthcare AI is rapidly evolving; join communities and follow research to stay informed

Conclusion

LLMs are powerful tools with genuine healthcare applications, but they're not a universal solution. Success comes from careful task selection, understanding limitations, and designing implementations with human oversight and safety in mind. The organizations succeeding with healthcare LLMs are those that started with realistic expectations and built implementations that complemented human expertise rather than replacing it.

Sources

Frequently Asked

Common Questions

Can we use ChatGPT for healthcare tasks if we're careful?

ChatGPT's terms of service don't allow healthcare use, and it sends data to external servers which violates HIPAA. Do not use ChatGPT with patient data. Use enterprise LLM services with HIPAA compliance or open-source models.

How accurate do LLMs need to be for healthcare use?

It depends on the application. For clinical decision-making, accuracy should exceed 95%. For administrative tasks with human review, 80%+ is acceptable. For any patient-facing application, plan for comprehensive human oversight.

What's the difference between LLMs and specialized healthcare AI?

LLMs are general-purpose models trained to understand and generate text. Specialized healthcare AI includes machine learning models trained specifically on healthcare data for specific tasks (diagnosis, risk prediction, etc.). Both have roles.

Do we need our own LLM or can we use a commercial service?

Commercial services work fine if they have HIPAA compliance and enterprise configurations. Many healthcare organizations use Claude or other enterprise LLM services with proper safeguards. Only run your own if you have the technical capability and need maximum control.

Engineering