FHIR + Snowflake: The Integration Healthcare Data Teams Keep Getting Wrong

Every health system with an Epic or Oracle Health instance now has a FHIR R4 API. It's mandated. The 21st Century Cures Act made sure of that. What nobody mandated — and what almost nobody has figured out — is how to actually analyze that data.

There's a yawning gap between having a FHIR-compliant API and having FHIR data you can query in a warehouse. I've watched teams spend months building elaborate ingestion pipelines that produce data nobody can use, because they treated FHIR as just another JSON API instead of understanding what it actually is: a deeply nested, polymorphic, extension-laden document format that was designed for interoperability, not analytics.

Here's the architecture that actually works — and the pitfalls that will waste six months of your team's time if you don't know about them upfront.

Why FHIR Is Hard to Work With in a Warehouse

FHIR's design is brilliant for what it was built to do: exchange clinical data between systems that don't share the same data model. For that purpose, its flexibility is a feature. For analytics, that same flexibility is a nightmare.

Consider a Patient resource. The name isn't a string — it's an array of HumanName objects, each with a use field (official, nickname, maiden, etc.), a family string, and a given array. To get the patient's legal name you need to filter the array for use = 'official', then concatenate given[0] and family. That's a four-step operation to get what SQL developers expect to be a single column.

Now multiply that complexity across every resource type, and layer on:

Polymorphic choice types — the value[x] pattern means an Observation's value might be a valueQuantity, valueString, valueBoolean, valueCodeableConcept, or half a dozen other types, and only one will be populated per row
Extensions — vendor-specific and IG-specific fields that live in a nested extension array, keyed by URL, with their own polymorphic value types
CodeableConcept everywhere — diagnoses, medications, observations — all represented as arrays of codings across multiple vocabulary systems (LOINC, SNOMED, RxNorm, ICD-10, local codes), often with no guarantee of which system will be present
Reference resolution — a reference to Patient/12345 might be a relative reference, an absolute URL, or a urn:uuid, depending on the source system

None of this is a bug in FHIR. It's the right design for a universal interoperability standard. But if you throw raw FHIR JSON into a warehouse and assume your analysts can query it, you're setting everyone up to fail.

Step One: The Bulk Export Pattern

The right entry point for analytics isn't the FHIR read API — it's the FHIR Bulk Data Access specification, specifically the $export operation. Instead of paginating through individual resources, bulk export gives you all resources of a given type as NDJSON files (newline-delimited JSON, one resource per line). For a mid-size health system, that might be millions of Patient and Observation records delivered as a set of compressed files.

The pattern looks like this:

Kick off a group-level export: GET [base]/Group/[id]/$export with an Accept: application/fhir+ndjson header
Poll the status URL until the job completes (these can run for hours on large datasets)
Download the output files — one per resource type — to S3 or Azure Blob
COPY INTO a Snowflake staging table with a single VARIANT column

That last step is where Snowflake earns its keep. A staging table with one VARIANT column absorbs raw FHIR JSON with no schema definition, no column mapping, no upfront decisions about what to extract. You load everything first, transform later. This is critical because your understanding of what you need from the data will change, and you want to be able to re-derive any field without going back to the source.

For incremental loads, the $export operation accepts a _since parameter (an ISO 8601 timestamp) that limits the export to resources updated after that point. Set up a daily job, track your high-water mark, and you have a simple but reliable incremental pattern that doesn't require change data capture or database-level triggers on the source system.

Step Two: The dbt Transformation Layer

With raw FHIR JSON sitting in Snowflake VARIANT columns, the transformation work begins. This is where dbt shines — and where most teams make their first big architectural mistake.

The mistake is trying to flatten everything at once. Don't write a single staging model that extracts every field you might ever need from a Patient resource. Instead, build three layers:

Staging: Extract the obvious fields

In your staging models, pull out the scalar fields that are always present and always in the same location. For Patient, that's id, gender, birthDate, and the raw resource for downstream use. Use Snowflake's colon notation for simple paths: src:id::string, src:gender::string, src:birthDate::string. Keep the full VARIANT column in staging — you'll reference it repeatedly in downstream models.

Intermediate: Handle the complexity

This is where you tackle arrays, choice types, and extensions. Snowflake's LATERAL FLATTEN is your primary tool. To extract the official name from a Patient:

LATERAL FLATTEN(input => src:name) AS n gives you one row per name object, then WHERE n.value:use::string = 'official' filters to the legal name. For given names, you need a second flatten: LATERAL FLATTEN(input => n.value:given) AS g to handle the array of first/middle names.

For CodeableConcepts, the pattern is similar — flatten the coding array and filter for your preferred vocabulary system. For Observations, you'd prefer LOINC first, then SNOMED, then fall back to whatever local code exists. Encoding that preference order in a single intermediate model means every downstream consumer gets consistent codes without re-implementing the logic.

Marts: Analytics-ready tables

Mart models join and denormalize into wide tables that analysts actually want to query. A dim_patient table with one row per patient, a fct_encounter with one row per encounter, a fct_observation_lab filtered to laboratory results with values extracted into typed columns. By the time data reaches the mart layer, there's no JSON left — just regular SQL columns with real data types.

Resource-by-Resource: What to Actually Extract

The theory is clean. The practice is full of vendor-specific wrinkles. Here's what I've learned handling these resources across multiple health systems:

Patient

The MRN lives in the identifier array. Filter for the identifier where type.coding[].code = 'MR' or where the system matches your health system's known OID. Don't assume position — the MRN isn't always first. Race and ethnicity are almost always in extensions, typically the OMB race extension at http://hl7.org/fhir/us/core/StructureDefinition/us-core-race — you'll need to flatten the extension array and then flatten again to get the individual race category codings inside it.

Observation

The value[x] polymorphism is the main challenge. Write a single expression that coalesces across all possible value types: COALESCE(src:valueQuantity:value::float, src:valueString::string, ...). Separately extract the unit from valueQuantity.unit or valueQuantity.code (they're not always the same — one is display text, one is UCUM). For vital signs and labs, effectiveDateTime is usually present; for document-derived observations it might be effectivePeriod instead, so always check both.

Encounter

The encounter class — inpatient, outpatient, emergency — uses the class field with ActCode values (IMP, AMB, EMER). The hospitalization details you actually want (admit source, discharge disposition) live in hospitalization as CodeableConcepts, and the codes you get back will vary by EHR vendor. Build a reference table mapping vendor-specific codes to your standardized categories early, or you'll end up with 40 variations of "routine discharge."

The Pitfalls Nobody Warns You About

The FHIR spec tells you what's possible. Your EHR vendor tells you what's actual. These are not the same document.

A few specific traps that have cost teams real time:

Date parsing. FHIR dates are strings, not timestamps, and the spec allows partial dates: 2026, 2026-02, and 2026-02-23 are all valid date values. TRY_TO_DATE in Snowflake will silently return NULL for the partial formats. Pad them first: CASE WHEN LENGTH(birthDate) = 4 THEN birthDate || '-01-01' ... before casting.
Reference formats. Patient references might come as Patient/12345 (relative), https://fhir.epic.com/api/FHIR/R4/Patient/12345 (absolute), or urn:uuid:a7d43b2c-... (logical). Write a normalization function that strips everything except the resource ID, and apply it consistently.
Empty CodeableConcepts. Some vendors populate text but leave the coding array empty. If your pipeline requires a LOINC code and the EHR didn't map it, you'll get silent nulls. Build completeness checks into your dbt tests from day one.
The Observation status trap. FHIR Observations have a status field: registered, preliminary, final, amended, corrected, cancelled, entered-in-error. Most analytics use cases only want final and amended, but if you don't filter explicitly you'll be including preliminary values and retracted results in your cohorts.

The Complete Architecture

Putting it all together, the architecture that works looks like this:

Source: EHR FHIR R4 API (Epic, Oracle Health, or your FHIR server of choice)
Extract: Daily $export job with _since for incremental → NDJSON files in S3
Load: Snowflake COPY INTO → raw.fhir.patient, raw.fhir.observation, etc., each with a single VARIANT column plus _loaded_at and _source_file metadata columns
Stage: dbt staging models extract scalar fields, preserve the raw VARIANT for downstream use
Intermediate: dbt intermediate models handle arrays, CodeableConcepts, extensions, reference normalization
Mart: dbt mart models produce analytics-ready wide tables — dim_patient, fct_encounter, fct_observation_vitals, fct_observation_lab, fct_condition, fct_medication_request
Serve: BI tools (Tableau, Looker, Power BI) or ad-hoc SQL against mart tables

The raw layer is immutable — you never modify it. Every transformation lives in dbt, is version-controlled, is tested, and can be re-run from scratch if you need to re-derive a field you didn't extract the first time. The VARIANT column is your insurance policy.

One More Thing: Schema Evolution

One of the underappreciated advantages of this architecture is how gracefully it handles change. When your EHR vendor upgrades to a new version of the US Core Implementation Guide and starts populating fields that didn't exist before, your raw layer absorbs it automatically — VARIANT doesn't care. Your staging and intermediate models just keep running, ignoring the new fields until you're ready to extract them. Contrast that with a column-mapped staging table where an unexpected new field in the source JSON breaks your load entirely.

This also means you can ship incrementally. Start with Patient, Encounter, and Observation. Get those marts to a state your clinical analytics team can use. Then add Condition, MedicationRequest, and DiagnosticReport in the next sprint. The raw layer is already there; you're just writing new dbt models.

Where AI Changes This

The architecture I've described is well-understood by people who've done it before. The problem is that most healthcare data teams haven't done it before, which means they rediscover every one of these pitfalls from scratch. That's changing.

We're starting to use AI agents to automate parts of this pipeline that were previously manual and error-prone. Specifically: given a FHIR StructureDefinition or Implementation Guide profile, agents can generate the dbt staging and intermediate models automatically — correctly handling the polymorphic fields, the array flattening, the CodeableConcept normalization. It's not perfect yet, but for the 80% of resources that follow predictable patterns, it's faster than writing it by hand and less error-prone than having a developer who's new to FHIR write it from memory.

We're also using agents to monitor data quality at the FHIR layer — detecting when a vendor upgrade changes the structure of an extension, when a code system that was present in staging disappears, when completeness rates for a particular field drop unexpectedly. These are the kinds of quiet failures that previously went unnoticed for weeks.

The architecture is the same. The humans doing the work are fewer, and the ones that remain are working on things that actually require judgment.

Ready to Actually Use Your FHIR Data?

We design and build FHIR analytics platforms — from bulk export architecture to analytics-ready Snowflake marts — for healthcare organizations that need more than compliance checkboxes.

Let's Talk