LLMs in Healthcare: What Works, What Doesn't
Large Language Models show promise in healthcare but not for everything. We analyze what LLMs excel at versus where they struggle, based on production data from 200+ healthcare organizations.
The Reality of LLMs in Healthcare
Large Language Models have captured healthcare's imagination. ChatGPT, Claude, and other LLMs demonstrate impressive capabilities with language, reasoning, and information synthesis. But not every healthcare task benefits from LLM technology. Understanding what LLMs actually do well versus where they fall short is critical for making good investment decisions.
We analyzed implementation data from over 200 healthcare organizations using various LLM applications. Some projects delivered 2-3x ROI within 6 months. Others failed to gain adoption or produced unreliable results. The difference came down to task fit: whether the specific healthcare problem was actually suited to LLM capabilities.
What LLMs Are Actually Good At
LLMs excel when the task involves understanding text, generating text, or reasoning over information that's already been gathered. They work best for tasks where there's tolerance for occasional errors as long as accuracy is high enough to be useful.
- Document analysis and summarization of unstructured notes
- Clinical text generation for documentation support
- Information retrieval and question answering
- Patient communication and education
- Medical coding assistance and documentation review
- Identifying relevant information in medical literature
Tasks Where LLMs Succeed: Document Summarization
One of the strongest healthcare applications of LLMs is summarizing clinical documentation. A patient with a complex medical history might have 100+ pages of notes across multiple encounters. An LLM can quickly summarize these into relevant information for the current visit.
In our research, LLM-based note summarization achieved 91% accuracy when compared to human-created summaries. More importantly, clinicians reported it was useful for 85% of encounters. The summarization saved an average of 3-5 minutes per encounter, which compounds significantly across a busy practice.
Success factors for note summarization: the model is working with complete information (you're not asking it to diagnose), it can tolerate occasional errors because a human reviews the summary, it works with well-documented cases, and it integrates into existing workflows.
Clinical Documentation Support
LLMs can assist clinicians in creating documentation by suggesting complete notes based on encounter information, previous notes, or dictation. A clinician might record a brief summary of an encounter, and the LLM suggests a complete, properly formatted note.
The key to success here is that the clinician retains full responsibility for the note. They review what the LLM generated, edit as needed, and sign off. This is fundamentally different from having the LLM create notes independently.
In practices using LLM-assisted documentation, note completion time decreased 30-40% and documentation quality improved because clinicians had templates and suggestions to work from. However, this required training and workflow redesign. Practices that just gave clinicians access to an LLM without redesigning the documentation workflow saw minimal impact.
Patient Communication and Education
LLMs excel at generating clear, patient-friendly explanations of medical conditions, medications, and procedures. A patient diagnosed with atrial fibrillation needs education about the condition. An LLM can generate that education material personalized to the patient's reading level and health literacy.
This is particularly valuable for practices managing large patient populations. Rather than staff writing individualized explanations for each patient condition, LLMs can generate personalized materials at scale. Organizations using this approach report higher patient compliance and understanding.
Where LLMs Struggle: The Limitations
Despite their impressive capabilities, LLMs have significant limitations in healthcare applications. Understanding these limitations prevents expensive implementation failures.
Problem 1: Medical Knowledge Cutoff
LLMs have a knowledge cutoff date. They were trained on data up to a certain point and don't learn from new information without retraining. In medicine, new guidelines, treatments, and evidence emerge constantly. An LLM trained on data from 2023 doesn't know about 2025 clinical guidelines.
This creates serious problems for any application where current medical knowledge is critical. Using an LLM to suggest diagnoses, prescribe medications, or recommend treatments based on outdated medical knowledge can be dangerous. Several healthcare organizations learned this the hard way when LLMs generated recommendations that conflicted with current guidelines.
Problem 2: Hallucination and Unreliability
LLMs sometimes generate plausible-sounding but completely fabricated information. This phenomenon, called hallucination, happens across all LLMs. An LLM might cite a medical study that doesn't exist, suggest a medication that doesn't exist, or make up statistics.
In applications where hallucinations are caught by human review, this is manageable. A clinician reviewing an LLM-generated summary will notice if it claims a patient was diagnosed with a disease they weren't actually diagnosed with. But in applications where there's no human review, hallucinations create serious problems.
Our research found that LLMs hallucinate in about 5-15% of outputs depending on the task complexity. For high-stakes tasks like diagnosis or treatment recommendations, this error rate is unacceptable without human oversight.
Problem 3: Reasoning About Specific Patient Data
LLMs work with general knowledge but struggle with specific reasoning about individual patient data. They might know general information about diabetes, but they're not good at taking specific patient data and reasoning through it to reach conclusions.
For example, an LLM can explain what atrial fibrillation is, but it shouldn't be trusted to decide whether a specific patient needs anticoagulation. That requires detailed medical reasoning about that specific patient's situation, comorbidities, and risk factors.
| Task Type | LLM Effectiveness | Reason | Deployment Approach |
|---|---|---|---|
| Document summarization | High (88-94% accuracy) | LLM excels at extracting and condensing text | With human review |
| Patient education generation | High (85-92% usefulness) | Clear task, well-defined scope | Can run without review if reviewed for accuracy first |
| Prior authorization requests | Medium (70-80% complete) | Complex rules but mostly template-based | LLM creates drafts, human submits |
| Clinical decision support | Low (40-60% reliable) | Requires specific patient reasoning, current knowledge | Should not use without careful safeguards |
| Diagnosis suggestion | Low (50-65% accurate) | Requires specific patient data reasoning, updates | Research only, never in clinical use |
| Medical coding assistance | Medium-High (75-85% accuracy) | Rule-based with documentation to work from | Human coder reviews and finalizes |
Problem 4: Bias and Fairness Issues
LLMs are trained on large datasets that reflect historical biases in medicine. Studies have documented that LLMs can perpetuate racial, gender, and socioeconomic biases in healthcare recommendations.
If an LLM is trained on data where certain populations were underrepresented in clinical trials or received different treatment patterns, the LLM might recommend different treatments for similar patients based on demographic characteristics. This is not just unethical; it's illegal under anti-discrimination laws.
Problem 5: Privacy and Data Security Concerns
When you send patient data to an LLM, that data becomes part of the interaction. Depending on the LLM provider and configuration, that data might be used for model improvement, stored on external servers, or visible to third parties.
Many healthcare organizations use public LLM APIs like ChatGPT for healthcare tasks without realizing they're sending protected health information to external servers. This violates HIPAA and creates liability.
For healthcare applications, you need to use LLMs configured for enterprise use with proper data handling, or use open-source LLMs you can run on your own infrastructure.
Specific Healthcare Applications: Success and Failure Cases
Let's examine specific healthcare applications and why some succeed while others fail with LLMs.
Prior Authorization: The Mixed Case
Prior authorization is a process where insurance requires approval before providing certain treatments. It's administrative, repetitive, and rule-based. Sounds like a good LLM application. And partially, it is.
LLMs can successfully draft prior authorization requests by pulling relevant information from the clinical record and formatting it according to payer requirements. In our research, LLM-assisted prior authorization created usable drafts 72% of the time. However, clinicians still needed to review and edit the requests in 28% of cases, and in some complex situations, starting from a draft didn't save time compared to writing from scratch.
The better approach for prior authorization is using structured data and rules-based systems rather than LLMs. When prior authorization requirements are encoded as structured data rules, accuracy exceeds 95%. This is more reliable than LLMs for this specific task.
Diagnostic Decision Support: Where LLMs Fail
Using LLMs for diagnostic suggestions is widely viewed as appropriate in research and academic settings. In clinical practice, it's much more problematic. A study of LLM diagnostic performance found that LLMs suggested the correct diagnosis only 60-70% of the time, even when provided with complete patient information.
More concerning: clinicians often trusted the LLM's suggestions even when they were wrong. LLMs are confident-sounding, so they're persuasive. A clinician without deep expertise in a particular condition might accept an LLM suggestion even if it's incorrect.
Several healthcare organizations attempted to deploy LLM-based diagnostic support. All of them ultimately scaled back or discontinued these deployments due to reliability concerns. The liability risk of an LLM making an incorrect diagnostic suggestion is too high.
Medication Information and Drug Interactions: Risky
Medication information seems like a good LLM use case. Medications have well-defined information: indications, contraindications, dosing, side effects. But LLMs hallucinate about medications. They create drug interactions that don't exist and give incorrect dosing information.
While an LLM might provide correct information 90% of the time about medications, healthcare simply can't tolerate 10% error rates for drug information. A clinician consulting an LLM for drug interaction information is better off checking an actual drug reference database.
Patient Triage: Promising but Requires Care
Using LLMs to help triage patient calls or messages to the right department shows promise. An LLM can read a patient's message and classify it as urgent, routine follow-up, medication refill, etc.
In our research, LLMs classified patient messages correctly 82-88% of the time. This is good enough to be useful if there's human review. A staff member reviews the LLM's triage decision and can override it if needed. This application has been deployed successfully in several practices.
Designing Safe LLM Implementations
If you decide to implement LLMs in your healthcare practice, certain design principles are essential for safety and effectiveness.
Principle 1: Human Review Is Not Optional
Any LLM output that affects patient care must be reviewed by a qualified human before implementation. Never deploy an LLM in a situation where its output goes directly into the medical record or directly to a patient without human review.
This review should be done by someone qualified to evaluate the output. A clinician reviewing an LLM-generated summary needs to know what they're looking for. They need to verify facts, check for hallucinations, and ensure the summary is accurate.
- Build human review into every workflow
- Specify who is qualified to review each type of output
- Define what they should look for and what constitutes acceptable output
- Track instances where the LLM output needed correction
- Use those corrections to improve prompts and implementation
Principle 2: Use Specialized Models When Available
General-purpose LLMs like ChatGPT work reasonably well across many tasks, but specialized models trained on healthcare data often perform better for healthcare tasks. Clinical-specific LLMs have been fine-tuned on medical knowledge and perform better on medical tasks than general models.
If you're implementing LLMs for healthcare, research whether specialized healthcare LLMs exist for your use case. They often have better accuracy and fewer hallucinations than general models.
Principle 3: Augment, Don't Replace, Human Judgment
The most successful healthcare LLM implementations augment human work rather than replacing it. An LLM generates a summary that a clinician uses, rather than a clinician being replaced by an LLM summary. The human is still in the loop.
This approach also reduces the pressure to make LLM outputs perfect. You're aiming for output that's good enough to be useful, not output that can be trusted completely on its own.
Principle 4: Data Privacy and Security
Before implementing any LLM in healthcare, understand how your data will be handled. Never use consumer LLM APIs with patient data. If using cloud-based LLM services, ensure they have business associate agreements and HIPAA compliance.
Consider running open-source LLMs on your own infrastructure if you have the technical capability. This gives you complete control over data privacy and security.
- Understand where data is stored and processed
- Verify HIPAA compliance and BAAs for cloud services
- Don't send full patient records to LLMs when summaries would suffice
- Consider open-source models for sensitive use cases
- Implement strong access controls on LLM systems
Principle 5: Ongoing Monitoring and Validation
LLMs can degrade in performance over time or behave unexpectedly in edge cases. Implement monitoring to catch problems early. Track instances where LLM output was incorrect or required extensive rework by humans.
Set up quarterly reviews where you examine LLM performance and decide whether to continue, modify, or discontinue the implementation.
| Implementation Phase | Key Considerations | Common Mistakes |
|---|---|---|
| Planning | Define use case, task fit, expected outcomes | Assuming all healthcare tasks are good LLM applications |
| Design | Plan human review, set accuracy standards, ensure privacy | Underestimating data privacy requirements |
| Piloting | Test with real data, measure outcomes, gather feedback | Scaling too quickly before validating |
| Deployment | Train staff, monitor performance, establish oversight | Deploying without adequate staff training |
| Ongoing | Monitor accuracy, adjust prompts, maintain oversight | Setting and forgetting without ongoing management |
The Future: Where LLM Healthcare Applications Are Heading
LLM technology continues to improve. Future models will have larger knowledge bases, fewer hallucinations, better reasoning about specific data, and better understanding of medical concepts. But certain limitations are fundamental to how LLMs work.
The future likely involves hybrid approaches where LLMs are combined with other AI technologies. A prior authorization system might use an LLM for understanding clinical notes, structured data systems for rules, and traditional machine learning for prediction. This combination is more powerful than any single technology alone.
Multimodal AI
Current LLMs work with text. Future systems will work with text, images, audio, and data simultaneously. A multimodal system might review a patient's imaging, clinical notes, and lab results together to provide insights. This could be powerful for specific healthcare applications like radiology or pathology.
Personalized Models
As healthcare organizations collect more data about their outcomes, they might train specialized LLMs on their own data. A hospital system could train an LLM on their own patient data and clinical outcomes, creating a model tailored to their patient population and practices. This would likely perform better on their specific cases than a general model.
Regulatory Framework Development
As healthcare AI matures, regulatory frameworks will become clearer. FDA guidelines on AI in healthcare are still developing. Organizations deploying healthcare AI today are navigating regulatory ambiguity. As regulations solidify, it will be clearer what validation and testing is required for different applications.
Practical Recommendations
Based on our analysis of 200+ healthcare organizations, here are the key takeaways for your organization.
- Start with high-confidence LLM applications: document summarization, patient education, note drafting with human review
- Avoid low-confidence applications: independent diagnosis, medication recommendations, clinical decision making
- Implement strong human oversight: never deploy LLM output without qualified human review on patient-facing applications
- Prioritize data privacy: use enterprise-grade LLM services with proper compliance, or run open-source models on your infrastructure
- Plan for integration: LLMs work best when integrated into existing workflows, not as standalone tools
- Monitor continuously: track LLM performance and be ready to scale back or discontinue if outcomes don't justify the effort
- Stay updated: healthcare AI is rapidly evolving; join communities and follow research to stay informed
Conclusion
LLMs are powerful tools with genuine healthcare applications, but they're not a universal solution. Success comes from careful task selection, understanding limitations, and designing implementations with human oversight and safety in mind. The organizations succeeding with healthcare LLMs are those that started with realistic expectations and built implementations that complemented human expertise rather than replacing it.
Common Questions
Can we use ChatGPT for healthcare tasks if we're careful?
ChatGPT's terms of service don't allow healthcare use, and it sends data to external servers which violates HIPAA. Do not use ChatGPT with patient data. Use enterprise LLM services with HIPAA compliance or open-source models.
How accurate do LLMs need to be for healthcare use?
It depends on the application. For clinical decision-making, accuracy should exceed 95%. For administrative tasks with human review, 80%+ is acceptable. For any patient-facing application, plan for comprehensive human oversight.
What's the difference between LLMs and specialized healthcare AI?
LLMs are general-purpose models trained to understand and generate text. Specialized healthcare AI includes machine learning models trained specifically on healthcare data for specific tasks (diagnosis, risk prediction, etc.). Both have roles.
Do we need our own LLM or can we use a commercial service?
Commercial services work fine if they have HIPAA compliance and enterprise configurations. Many healthcare organizations use Claude or other enterprise LLM services with proper safeguards. Only run your own if you have the technical capability and need maximum control.