Building Auditable LLM Workflows for Medical Coding

May 26, 2026· 17 min read
Medical coding is one of healthcare’s most persistent engineering bottlenecks. Converting a messy, jargon-filled clinical narrative into a precise set of alphanumeric codes—selected from a search space of over 70,000 possibilities in the US ICD-10-CM system—is an "extreme multi-label" classification problem.
Early attempts to solve this with Large Language Models (LLMs) fell into a predictable trap: treating the model as an open-ended code generator. Asking an LLM to look at a 10-page discharge summary and output the correct codes invariably yields hallucinations, missing co-morbidities, and un-auditable outputs. A hallucinated digit or an unauthorized code mapping directly impacts medical compliance and financial billing security.
To build a reliable system for high-stakes healthcare environments, we combine the cognitive flexibility of LLMs with a deterministic, state-driven workflow engine: the model extracts and verifies evidence, while the runtime controls validation, routing, and human review.

Extract, Verify, Reconcile

Monolithic prompts fail because they force a model to perform multiple complex cognitive tasks simultaneously: parsing noisy text, recalling taxonomy rules, and formatting data. Workflow decomposition breaks this black box into discrete, predictable stages managed by an explicit software pipeline.
  • Extract: Clinical notes suffer from "note bloat"—redundant histories, administrative templates, and conversational noise. The pipeline begins by isolating the clinical signal. An initial extraction pass scans the raw note solely to pull out explicit diagnostic phrases, active symptoms, and procedures, reducing the surrounding text noise while preserving the local evidence span.
  • Verify: Rather than evaluating the entire taxonomy, the system uses a localized retriever (such as a sparse or dense vector index) to map the extracted phrases to a compact shortlist of candidate codes. The LLM is then introduced purely to evaluate this shortlist against the isolated clinical phrases.
  • Reconcile: The final logical check resolves clinical context. The pipeline analyzes the condition's status—determining whether a diagnosis is active, historical, or explicitly negated (e.g., "patient's chest pain was ruled out").
The boundary is important:
This keeps the LLM out of the role it is weakest at—unbounded taxonomy search—and moves it into the role where it is more useful: evidence-grounded verification inside a controlled workflow.

Implement the Extraction Gatekeeper

Clinical notes suffer from severe "note bloat"—redundant templates, historical copy-pasting, and administrative boilerplate. If you pass an uncleaned note directly to an embedding model or a vector search engine, this noise dilutes the semantic weight of the actual, active diagnoses, ruining retrieval recall.
Therefore, the pipeline must begin with a dedicated Extraction Layer. This layer has no knowledge of ICD-10 codes or billing guidelines. Its sole mandate is Clinical Named Entity Recognition: isolating clean, granular diagnostic signals and their localized textual anchors from the raw text. Keeping the extraction layer taxonomy-agnostic guarantees that the next stage (retrieval) receives clean, high-signal search queries completely decoupled from formatting logic.
Below is how we model and execute this isolated extraction step using Pydantic to enforce schema-shaped text extraction:
In practice, the extraction prompt can include a small set of few-shot edge cases for abbreviations, negation, suspected diagnoses, and historical conditions. I include a compact version of that prompt block in the appendix.

Retrieve Candidate Shortlist

Once we have a clean array of ExtractedClinicalSignal objects, we must map them to the 70,000+ codes in the taxonomy. We do not use the LLM for this mapping. Instead, we pipe the extracted entities into a deterministic, programmatic Candidate Search Engine (The Retriever).
In practice, the retrieval layer can be backed by a local index built from the official ICD-10-CM codebook. Each indexed document should contain structured fields such as:
The retriever then uses hybrid retrieval to generate a shortlist.
  • Sparse retrieval such as BM25 handles exact lexical matches: specific anatomy terms, alphanumeric phrases, abbreviations already present in the note, and codebook wording.
  • Dense retrieval handles semantic variation: mapping “SOB” to “dyspnea,” “shortness of breath,” or related clinical language.
For each extracted clinical signal, the retriever queries the hybrid index and returns the top candidates, usually a small top_k such as 3 to 5 codes. Those results are aggregated into the candidate_shortlist used by the constrained verifier in the next stage.

Run the Constrained Verifier

An auditable clinical system must prove its work. Before any code is suggested to a human operator, the system should generate an intermediate evidence packet. This data structure binds every candidate code to a verbatim text anchor found within the medical record.
Using Pydantic, we can model this constraints-driven schema to ensure the data structure remains consistent, predictable, and scannable by downstream validation tools.
With the candidate_shortlist generated deterministically by the retriever, we introduce the LLM to act as a judge. Its task is to evaluate the shortlisted candidates against the clean extracted evidence packet, producing an auditable structured payload.

Validate Against the Official Taxonomy

An LLM’s output can be syntactically correct according to the JSON schema, but clinically invalid according to healthcare regulations. Therefore, the payload must pass through a programmatic Taxonomy Validator—a deterministic software layer running checks against an official CMS/CDC tabular database.
This programmatic validator executes three non-negotiable checks:
  • Code Existence Check: Does the code actually exist in the current fiscal year's active taxonomy? If the LLM generates a defunct or hallucinated code extension, the system catches it instantly via a simple hash-map lookup.
  • Billability Verification: In medical coding, a code is only "billable" if it is driven to its highest level of specificity (usually 4 to 7 characters). For instance, if the LLM verifies category I50 (Heart failure), the taxonomy check flags it as invalid/non-billable because it represents a broad parent category rather than a terminal code.
  • Hierarchical Near-Miss Resolution: If the LLM marks a valid code (like I50.21) as a mismatch, but the text contains verified evidence for the parent block (I50), the system triggers a sibling expansion. The code pipeline queries the local taxonomy database for all child nodes of I50 (e.g., I50.23 - Acute on chronic systolic heart failure) and automatically queues them for a targeted verification retry.

Route Uncertainty to Human Review

The goal of this architectural pattern is not full automation, but a highly defensive, human-in-the-loop workflow. An EvidencePacket is not the final business deliverable of a medical coding system. The AI does not finalize the billing submission; it prepares clean, auditable evidence for professional adjudication.
An automated routing layer monitors the output of both the verifier and the taxonomy validator. Cases are flagged for priority human intervention if they hit explicit risk thresholds:
  • Contradiction Flags: The model selects a code but marks its negation status as negated or uncertain.
  • Missing Enforcements: The model sets a verification decision to needs_review or provides an empty supporting_evidence_span.
  • Billability Exceptions: The taxonomy validator catches a valid category root that lacks the required sub-classification digits for financial clearance.
By rendering the supporting_evidence_span directly alongside the code suggestion in the auditor's user interface, the system changes the human task from an exhausting document hunt to a swift visual confirmation. Operators can accept or reject a code in seconds, keeping human expertise exactly where it is needed most: at the final gate of clinical and financial adjudication.

Conclusion

Moving medical coding out of the "black box" of open-ended text generation requires separating clinical intuition from rigid data validation. By decoupling the process into explicit stages—extracting raw clinical data, running constrained LLM verifiers, executing programmatic taxonomy checks, and routing edge cases to human auditors—we mitigate the intrinsic risks of language models.
Ultimately, building a reliable medical coding system is less about forcing an LLM to memorize 70,000 distinct alphanumeric keys, and more about wrapping the model in a deterministic software architecture that enforces verification every step of the way.
 

Appendix: Prompt Tactics for Clinical Extraction

Clinical narratives are messy. Abbreviations, continuous prose, negation, suspected diagnoses, and historical conditions can break a naive extraction prompt. A practical extraction layer should include a few-shot block that teaches the model how to handle these edge cases before it sees the actual patient note.
1. The rule-out trap
Clinical notes often contain phrases such as “rule out MI,” “suspected PE,” or “possible pneumonia.” These should not be treated as confirmed diagnoses. They should be captured as uncertain or suspected findings so the verifier can reconcile them with labs, imaging, discharge diagnosis, or downstream coding rules.
Prompt constraint:
2. Overlapping entities
In phrases such as “bilateral lower extremity edema secondary to acute systolic heart failure,” a naive extractor may split the text into isolated fragments and lose the causal relationship.
Prompt constraint:
3. The exact span guardrail
The model must not rewrite evidence spans. If the note says “Pt has HTN,” the output span must not become “Patient has hypertension.”
Programmatic check:
If a span is not found verbatim in the source note, reject the extraction payload before it reaches retrieval or code verification.
Buy Me a Coffee
上一篇
[Leetcode 240] 搜索二维矩阵 II
下一篇
World Models are becoming the simulation substrate for Agents