Every health system with an Epic or Oracle Health instance now has a FHIR R4 API. It's mandated. The 21st Century Cures Act made sure of that. What nobody mandated — and what almost nobody has figured out — is how to actually analyze that data.
There's a yawning gap between having a FHIR-compliant API and having FHIR data you can query in a warehouse. I've watched teams spend months building elaborate ingestion pipelines that produce data nobody can use, because they treated FHIR as just another JSON API instead of understanding what it actually is: a deeply nested, polymorphic, extension-laden document format that was designed for interoperability, not analytics.
Here's the architecture that actually works — and the pitfalls that will waste six months of your team's time if you don't know about them upfront.
Why FHIR Is Hard to Work With in a Warehouse
FHIR's design is brilliant for what it was built to do: exchange clinical data between systems that don't share the same data model. For that purpose, its flexibility is a feature. For analytics, that same flexibility is a nightmare.
Consider a Patient resource. The name isn't a string — it's an array of HumanName objects, each with a use field (official, nickname, maiden, etc.), a family string, and a given array. To get the patient's legal name you need to filter the array for use = 'official', then concatenate given[0] and family. That's a four-step operation to get what SQL developers expect to be a single column.
Now multiply that complexity across every resource type, and layer on:
- Polymorphic choice types — the
value[x]pattern means an Observation's value might be avalueQuantity,valueString,valueBoolean,valueCodeableConcept, or half a dozen other types, and only one will be populated per row - Extensions — vendor-specific and IG-specific fields that live in a nested
extensionarray, keyed by URL, with their own polymorphic value types - CodeableConcept everywhere — diagnoses, medications, observations — all represented as arrays of codings across multiple vocabulary systems (LOINC, SNOMED, RxNorm, ICD-10, local codes), often with no guarantee of which system will be present
- Reference resolution — a reference to
Patient/12345might be a relative reference, an absolute URL, or aurn:uuid, depending on the source system
None of this is a bug in FHIR. It's the right design for a universal interoperability standard. But if you throw raw FHIR JSON into a warehouse and assume your analysts can query it, you're setting everyone up to fail.
Step One: The Bulk Export Pattern
The right entry point for analytics isn't the FHIR read API — it's the FHIR Bulk Data Access specification, specifically the $export operation. Instead of paginating through individual resources, bulk export gives you all resources of a given type as NDJSON files (newline-delimited JSON, one resource per line). For a mid-size health system, that might be millions of Patient and Observation records delivered as a set of compressed files.
The pattern looks like this:
- Kick off a group-level export:
GET [base]/Group/[id]/$exportwith anAccept: application/fhir+ndjsonheader - Poll the status URL until the job completes (these can run for hours on large datasets)
- Download the output files — one per resource type — to S3 or Azure Blob
COPY INTOa Snowflake staging table with a singleVARIANTcolumn
That last step is where Snowflake earns its keep. A staging table with one VARIANT column absorbs raw FHIR JSON with no schema definition, no column mapping, no upfront decisions about what to extract. You load everything first, transform later. This is critical because your understanding of what you need from the data will change, and you want to be able to re-derive any field without going back to the source.
For incremental loads, the $export operation accepts a _since parameter (an ISO 8601 timestamp) that limits the export to resources updated after that point. Set up a daily job, track your high-water mark, and you have a simple but reliable incremental pattern that doesn't require change data capture or database-level triggers on the source system.
Step Two: The dbt Transformation Layer
With raw FHIR JSON sitting in Snowflake VARIANT columns, the transformation work begins. This is where dbt shines — and where most teams make their first big architectural mistake.
The mistake is trying to flatten everything at once. Don't write a single staging model that extracts every field you might ever need from a Patient resource. Instead, build three layers:
Staging: Extract the obvious fields
In your staging models, pull out the scalar fields that are always present and always in the same location. For Patient, that's id, gender, birthDate, and the raw resource for downstream use. Use Snowflake's colon notation for simple paths: src:id::string, src:gender::string, src:birthDate::string. Keep the full VARIANT column in staging — you'll reference it repeatedly in downstream models.
Intermediate: Handle the complexity
This is where you tackle arrays, choice types, and extensions. Snowflake's LATERAL FLATTEN is your primary tool. To extract the official name from a Patient:
LATERAL FLATTEN(input => src:name) AS n gives you one row per name object, then WHERE n.value:use::string = 'official' filters to the legal name. For given names, you need a second flatten: LATERAL FLATTEN(input => n.value:given) AS g to handle the array of first/middle names.
For CodeableConcepts, the pattern is similar — flatten the coding array and filter for your preferred vocabulary system. For Observations, you'd prefer LOINC first, then SNOMED, then fall back to whatever local code exists. Encoding that preference order in a single intermediate model means every downstream consumer gets consistent codes without re-implementing the logic.
Marts: Analytics-ready tables
Mart models join and denormalize into wide tables that analysts actually want to query. A dim_patient table with one row per patient, a fct_encounter with one row per encounter, a fct_observation_lab filtered to laboratory results with values extracted into typed columns. By the time data reaches the mart layer, there's no JSON left — just regular SQL columns with real data types.
Resource-by-Resource: What to Actually Extract
The theory is clean. The practice is full of vendor-specific wrinkles. Here's what I've learned handling these resources across multiple health systems:
Patient
The MRN lives in the identifier array. Filter for the identifier where type.coding[].code = 'MR' or where the system matches your health system's known OID. Don't assume position — the MRN isn't always first. Race and ethnicity are almost always in extensions, typically the OMB race extension at http://hl7.org/fhir/us/core/StructureDefinition/us-core-race — you'll need to flatten the extension array and then flatten again to get the individual race category codings inside it.
Observation
The value[x] polymorphism is the main challenge. Write a single expression that coalesces across all possible value types: COALESCE(src:valueQuantity:value::float, src:valueString::string, ...). Separately extract the unit from valueQuantity.unit or valueQuantity.code (they're not always the same — one is display text, one is UCUM). For vital signs and labs, effectiveDateTime is usually present; for document-derived observations it might be effectivePeriod instead, so always check both.
Encounter
The encounter class — inpatient, outpatient, emergency — uses the class field with ActCode values (IMP, AMB, EMER). The hospitalization details you actually want (admit source, discharge disposition) live in hospitalization as CodeableConcepts, and the codes you get back will vary by EHR vendor. Build a reference table mapping vendor-specific codes to your standardized categories early, or you'll end up with 40 variations of "routine discharge."
The Pitfalls Nobody Warns You About
The FHIR spec tells you what's possible. Your EHR vendor tells you what's actual. These are not the same document.
A few specific traps that have cost teams real time:
- Date parsing. FHIR dates are strings, not timestamps, and the spec allows partial dates:
2026,2026-02, and2026-02-23are all validdatevalues.TRY_TO_DATEin Snowflake will silently return NULL for the partial formats. Pad them first:CASE WHEN LENGTH(birthDate) = 4 THEN birthDate || '-01-01' ...before casting. - Reference formats. Patient references might come as
Patient/12345(relative),https://fhir.epic.com/api/FHIR/R4/Patient/12345(absolute), orurn:uuid:a7d43b2c-...(logical). Write a normalization function that strips everything except the resource ID, and apply it consistently. - Empty CodeableConcepts. Some vendors populate
textbut leave thecodingarray empty. If your pipeline requires a LOINC code and the EHR didn't map it, you'll get silent nulls. Build completeness checks into your dbt tests from day one. - The Observation status trap. FHIR Observations have a
statusfield:registered,preliminary,final,amended,corrected,cancelled,entered-in-error. Most analytics use cases only wantfinalandamended, but if you don't filter explicitly you'll be including preliminary values and retracted results in your cohorts.
The Complete Architecture
Putting it all together, the architecture that works looks like this:
- Source: EHR FHIR R4 API (Epic, Oracle Health, or your FHIR server of choice)
- Extract: Daily
$exportjob with_sincefor incremental → NDJSON files in S3 - Load: Snowflake
COPY INTO→raw.fhir.patient,raw.fhir.observation, etc., each with a singleVARIANTcolumn plus_loaded_atand_source_filemetadata columns - Stage: dbt staging models extract scalar fields, preserve the raw VARIANT for downstream use
- Intermediate: dbt intermediate models handle arrays, CodeableConcepts, extensions, reference normalization
- Mart: dbt mart models produce analytics-ready wide tables —
dim_patient,fct_encounter,fct_observation_vitals,fct_observation_lab,fct_condition,fct_medication_request - Serve: BI tools (Tableau, Looker, Power BI) or ad-hoc SQL against mart tables
The raw layer is immutable — you never modify it. Every transformation lives in dbt, is version-controlled, is tested, and can be re-run from scratch if you need to re-derive a field you didn't extract the first time. The VARIANT column is your insurance policy.
One More Thing: Schema Evolution
One of the underappreciated advantages of this architecture is how gracefully it handles change. When your EHR vendor upgrades to a new version of the US Core Implementation Guide and starts populating fields that didn't exist before, your raw layer absorbs it automatically — VARIANT doesn't care. Your staging and intermediate models just keep running, ignoring the new fields until you're ready to extract them. Contrast that with a column-mapped staging table where an unexpected new field in the source JSON breaks your load entirely.
This also means you can ship incrementally. Start with Patient, Encounter, and Observation. Get those marts to a state your clinical analytics team can use. Then add Condition, MedicationRequest, and DiagnosticReport in the next sprint. The raw layer is already there; you're just writing new dbt models.
Where AI Changes This
The architecture I've described is well-understood by people who've done it before. The problem is that most healthcare data teams haven't done it before, which means they rediscover every one of these pitfalls from scratch. That's changing.
We're starting to use AI agents to automate parts of this pipeline that were previously manual and error-prone. Specifically: given a FHIR StructureDefinition or Implementation Guide profile, agents can generate the dbt staging and intermediate models automatically — correctly handling the polymorphic fields, the array flattening, the CodeableConcept normalization. It's not perfect yet, but for the 80% of resources that follow predictable patterns, it's faster than writing it by hand and less error-prone than having a developer who's new to FHIR write it from memory.
We're also using agents to monitor data quality at the FHIR layer — detecting when a vendor upgrade changes the structure of an extension, when a code system that was present in staging disappears, when completeness rates for a particular field drop unexpectedly. These are the kinds of quiet failures that previously went unnoticed for weeks.
The architecture is the same. The humans doing the work are fewer, and the ones that remain are working on things that actually require judgment.
Ready to Actually Use Your FHIR Data?
We design and build FHIR analytics platforms — from bulk export architecture to analytics-ready Snowflake marts — for healthcare organizations that need more than compliance checkboxes.
Let's Talk