The Dirty Pipeline Under AI Disease Surveillance: What TB Detection Really Reveals

The Headline Is Not the Hard Part

Several recent studies confirm that AI-enabled computer-aided detection software can match human radiologists on TB detection in population-based screening. Sensitivity, specificity, AUC — the numbers look good. Health systems in Thailand and across Southeast Asia are piloting these tools. The clinical AI community is excited.

But if you are a data engineer or healthcare IT architect, the paper is not what should excite you. What should excite you — and terrify you — is the infrastructure question the paper never answers: how do you build the data pipeline that keeps this model accurate at scale, across heterogeneous hospital systems, over time?

That is where population-scale clinical AI actually lives or dies. Not in the model architecture. In the pipeline.

DICOM Is Not a Data Format. It Is a Negotiation.

Every radiology AI system starts with DICOM. And DICOM is one of the most hostile data formats in healthcare IT. The spec is enormous, vendor implementations vary wildly, and the metadata that matters — acquisition parameters, device manufacturer, patient demographics, prior study references — is often missing, malformed, or buried in private tags.

Run a TB detection model trained on a Canon medical-grade CT scanner against images from a Siemens unit in a resource-constrained Thai district hospital, and you will see what distribution shift looks like in production. The model does not crash. It just quietly degrades. Sensitivity drops. False negatives creep up. Nobody notices until outcomes data comes back months later — if it ever does.

A mature clinical AI pipeline needs DICOM normalization as a first-class concern. That means standardizing window level and width, handling JPEG2000 vs uncompressed transfers, stripping PHI for model inference while preserving it for downstream FHIR write-back, and flagging images that fall outside the model training distribution before inference ever runs. This is table stakes. Most teams skip it.

Ground Truth Is Expensive and It Expires

Here is the feedback loop problem that does not show up in research papers: the ground truth that trained your TB detection model was generated by expert radiologists with time, tooling, and training. In production, the downstream confirmation — culture results, clinical follow-up, treatment response — arrives weeks later, in a different system, under a different patient identifier, if it arrives at all.

Without a reliable outcome feedback loop, you cannot detect model drift. You cannot retrain. You are flying blind.

The right architecture solves this with a FHIR-native event bus. When the AI generates a finding — probable TB, low confidence, recommend confirmatory test — that finding goes into a FHIR DiagnosticReport resource and fires an event. When lab results come back, when a physician accepts or overrides the AI recommendation, when treatment is initiated — each of those is a FHIR event too. Your data platform subscribes to all of it and assembles the ground truth record incrementally.

This is not a research problem. It is a data engineering problem. Build the event-driven feedback pipeline or your model accuracy numbers from the pilot will not survive contact with the real world.

Batch Inference Does Not Work for Population Screening

Population-based TB screening means high-throughput chest X-ray programs — hundreds or thousands of images per day at scale. Batch inference with a nightly ETL job is not the right model. You need near-real-time inference triggered at ingestion, with results available before the patient leaves the facility.

That requires a streaming-first architecture. DICOM images arrive, get normalized and anonymized, hit an inference queue, get scored by the model, and write results back to the EHR as a FHIR DiagnosticReport — all within minutes. The data stack underneath looks like: Kafka or Kinesis for event streaming, a feature store for patient context, a model serving layer with hard latency SLAs, and a FHIR API as the write surface.

If your team is still thinking about this as a batch analytics problem, you are building the wrong thing. The clinical workflow does not wait for your ETL window.

The Governance Layer Nobody Wants to Build

Cross-border population screening programs — Thailand, India, sub-Saharan Africa — involve patient data moving across regulatory jurisdictions with different privacy laws, different consent frameworks, and different data residency requirements. The model that performs well in one country may not be legally deployable in another without retraining on local data, executing a new DPIA, and standing up a compliant inference endpoint within national borders.

This is where sovereign AI infrastructure intersects with population health. Your inference pipeline needs to be deployable on-premise or in a local cloud region. Model weights may need to stay within national borders. Audit logs of every inference decision need to be tamper-evident and queryable for regulatory review. If your architecture assumes a single centralized cloud endpoint, you have already designed yourself out of the markets where this technology is most needed.

Healthcare IT teams that are not planning for this now will find themselves rebuilding their AI deployment layer in eighteen months when a regulator asks for it.

The Accuracy Number Is a Starting Line

AI matching radiologist accuracy on TB detection in a research setting is genuinely impressive. It is also the easiest part of the problem. The hard part is deploying that model across fifty hospitals with different scanners, different patient populations, and different data quality profiles. Building the feedback loop that keeps it accurate as the disease landscape shifts. Satisfying regulators in multiple jurisdictions simultaneously. Doing all of that without a team of ML engineers on-site at every facility.

The teams that figure this out first will not be the ones with the best model. They will be the ones with the best pipeline. If you are building population health AI and you do not have a DICOM normalization strategy, a FHIR-native feedback loop, and a governance layer designed for multi-jurisdiction deployment — you have a demo, not a product.

The question is not whether AI can detect TB. It can. The question is whether your data stack can keep it detecting TB accurately, across fifty heterogeneous sites, for the next five years without a radiologist babysitting it. That is the real engineering challenge. Build accordingly.