The Dirty Secret of Healthcare Data Engineering
Every healthcare data team I talk to has the same stack: Snowflake for warehousing, dbt for transformation, Airflow or Dagster for orchestration. Clean. Modern. Defensible.
Then you ask them how data gets into Snowflake, and the story falls apart. It's a graveyard of custom Python scripts. One for the Epic FHIR API. Another for the claims flat files from the clearinghouse. A third for the lab vendor's HL7 feed that someone wrote two years ago and nobody wants to touch. A fourth for pulling wearable device data from an IoT platform that changes its API every quarter.
This is the ingestion gap — and in 2026, it's the single biggest source of pipeline fragility in healthcare data platforms.
dlt Closes the Gap
dlt (data load tool) is a Python-native ingestion framework that treats extraction and loading as a first-class engineering concern. It's not a managed SaaS connector platform like Fivetran. It's not a heavyweight framework like Airbyte. It's a library — pip install it, write a Python function that yields data, and dlt handles schema inference, incremental loading, data typing, and delivery to your warehouse.
For healthcare data engineers, this matters for three reasons:
- Schema evolution without tears. Healthcare data schemas are notoriously unstable. FHIR R4 to R5 migrations, vendor-specific extensions, claims format changes — dlt's automatic schema inference and evolution means your pipeline doesn't break when an upstream source adds a field or changes a type. It adapts.
- Incremental loading by default. Healthcare datasets are large and append-heavy. Claims, encounters, lab results — you don't want full refreshes. dlt supports incremental loading with merge strategies out of the box, using cursor fields you define. No more hand-managing high-water marks in a state table you built at 2 AM.
- Pipeline as code. Your dlt pipeline is a Python script. It lives in your repo, gets version-controlled, goes through code review, and runs in your existing orchestrator. No UI-configured connectors. No black-box sync jobs. Just code you can reason about, test, and debug.
The Canonical Healthcare ELT Stack in 2026
The stack that's winning is now fully defined: dlt → Snowflake → dbt. Each layer does exactly one thing well:
- dlt extracts from source systems — FHIR APIs, SFTP drops, streaming endpoints, SaaS platforms — and loads raw data into Snowflake with full schema tracking and lineage metadata.
- Snowflake stores, governs, and computes. Cortex provides ML and LLM capabilities directly on the warehouse. Dynamic tables handle lightweight near-real-time transformations.
- dbt transforms raw data into analytics-ready models — patient cohorts, quality measures, cost models, risk scores — with tested, documented, version-controlled SQL.
This isn't theoretical. InterWorks just published a practical walkthrough of dlt-to-Snowflake pipelines, and the pattern maps directly to what production healthcare data teams need. The gap between tutorial and production is smaller than you think because dlt was designed for production from day one — it handles retries, state management, and schema contracts natively.
Healthcare-Specific Patterns That dlt Enables
When you stop fighting ingestion and start engineering it, new patterns emerge:
FHIR resource ingestion with automatic normalization. Write a dlt source that paginates through a FHIR server's Bundle responses, yields individual resources, and let dlt handle the nested JSON flattening into Snowflake's semi-structured columns. Combine with dbt to normalize into your clinical data model downstream.
Claims file processing with schema contracts. Define explicit schema contracts in dlt for your 837/835 file parsers. When a clearinghouse changes their format — and they will — your pipeline fails loudly at ingestion rather than silently producing garbage in your analytics layer.
Multi-source patient matching pipelines. Ingest from multiple EHR systems, claims sources, and device platforms into separate raw schemas. Use dlt's built-in lineage metadata to track provenance through the entire matching and deduplication workflow in dbt.
PHI-aware pipeline design. dlt pipelines run in your environment — your VPC, your containers, your security boundary. Unlike managed connector platforms, PHI never transits a third-party system. For HIPAA-regulated workloads, this isn't a nice-to-have. It's a requirement that eliminates an entire category of BAA negotiation and compliance risk.
The Real Cost of Not Adopting This
Every custom ingestion script is a maintenance liability. Every hand-rolled connector is a point of failure that only one person understands. Every pipeline that breaks at 3 AM because a source schema changed is an incident that didn't need to happen.
Healthcare data teams are under pressure to deliver AI-ready datasets, real-time analytics, and regulatory reporting — all simultaneously. You cannot afford to spend engineering cycles on problems that have been solved. dlt solves ingestion. dbt solves transformation. Snowflake solves storage and compute.
The teams that assemble this stack now will be the ones shipping production healthcare AI this year. The ones still maintaining artisanal ingestion scripts will be explaining to their CTO why the data pipeline broke during a CMS audit.
Pick your side.