Snowflake-Native Payor MDM: Master Data Management Without the Legacy Baggage

Every health plan has the same dirty secret: nobody trusts the payor data. The CRM says "Aetna." Claims says "AETNA INC." Credentialing says "Aetna Health Inc. (CT)." CMS reference data says "Aetna Life Insurance Company." They're all the same payor. They all have different addresses. At least one has a stale tax ID. And somewhere in your warehouse, they exist as four separate entities — each feeding a different report that tells a different story.

This is the payor master data problem, and it's so pervasive in healthcare that most organizations have stopped trying to fix it. Not because they don't care. Because the traditional tools for fixing it — enterprise MDM platforms from Informatica, IBM, Stibo, Reltio — are so expensive, complex, and slow to implement that the cure feels worse than the disease.

We took a different approach. We built a payor MDM system that runs entirely inside Snowflake — dbt models, Snowpark Python UDFs, a React stewardship console, and a multi-strategy matching engine that doesn't require a PhD to understand or a six-figure license to run. It collapsed 149 messy source records into 46 golden payor records in a single session. And it deploys in days, not quarters.

Here's how it works, and why it matters.

The Payor Data Problem Nobody Talks About

Provider MDM gets all the attention. Patient matching has an entire industry built around it. But payor data? It sits in a weird blind spot — everyone knows it's messy, nobody prioritizes cleaning it up, and the downstream effects are insidious.

When your payor master is fragmented, you get:

Inaccurate contract performance reporting. If the same payor exists as three entities in your warehouse, your contract analytics are splitting volume across phantom payors. Negotiators walk into renewal conversations with incomplete data.
Credentialing gaps. Provider credentialing depends on accurate payor identification. When the credentialing system's version of "United Healthcare" doesn't match claims, enrollment verification fails or routes to manual review.
Compliance exposure. CMS reporting requires accurate payor identification. If your internal payor master doesn't reconcile cleanly with CMS reference data, you're either over-reporting, under-reporting, or spending staff hours on manual reconciliation that should be automated.
Analytics that lie to you. Payor mix analysis, denial rate trending, revenue cycle benchmarking — all of it depends on knowing that "BCBS of Illinois," "Blue Cross Blue Shield IL," and "BCBSIL" are the same entity. If they're not linked, your dashboards are fiction.

The standard response is "we'll fix it in the next EHR upgrade" or "that's a data governance initiative for next year." Translation: it never gets fixed.

Why Legacy MDM Doesn't Fit

The enterprise MDM market exists to solve this class of problem. So why aren't health plans using it for payor data?

Three reasons:

Cost. Informatica MDM, IBM InfoSphere, Reltio, Stibo STEP — these platforms start at six figures annually and scale from there. For a full payor MDM implementation, you're looking at license costs plus 6–12 months of professional services. That's a defensible investment for a 50-hospital system mastering patient data. It's a non-starter for cleaning up a few hundred payor records.
Infrastructure overhead. Legacy MDM tools require their own infrastructure — separate servers, their own databases, ETL pipelines to move data in and out, and specialized administrators to keep it running. Your data leaves Snowflake, gets processed somewhere else, and comes back. Every hop is a latency risk, a security surface, and an integration to maintain.
Complexity mismatch. Enterprise MDM platforms are designed for millions of records with complex hierarchies — global patient matching, product information management, multi-geography supplier networks. Payor MDM is a focused problem: hundreds to low thousands of entities, relatively stable, with well-understood attributes. Using Informatica for this is like renting a crane to hang a picture frame.

The result: payor master data problems persist because the available tools are disproportionate to the problem. What's needed isn't a lighter version of enterprise MDM. It's a fundamentally different architecture.

What Snowflake-Native MDM Looks Like

Our system runs entirely within Snowflake. No external MDM engine. No data leaving the platform. No additional infrastructure. The architecture has four layers:

Source Integration

We ingest payor records from four source systems — CRM, Claims, Credentialing, and CMS Reference — each landing in staging models via dbt. The staging layer handles the first round of normalization: standardizing field names, cleaning whitespace, expanding common abbreviations (BCBS → Blue Cross Blue Shield, UHC → United Healthcare), and applying basic type casting. Every source record retains its provenance — which system it came from, when it was last updated, what the original values were.

Multi-Strategy Matching Engine

This is the core of the system, and where it diverges most from legacy approaches. Instead of running a single matching algorithm and trusting the score, we run an ensemble of strategies and let the data characteristics determine which one wins for each field and record pair.

The matching strategies:

Jaro-Winkler similarity for organization names — handles transpositions and prefix commonality well, which matters when "Blue Cross Blue Shield of Illinois" meets "Blue Cross and Blue Shield of IL"
Jaccard token overlap for addresses — breaks addresses into tokens and compares set overlap, robust against word order differences ("123 Main St Suite 200" vs "Suite 200, 123 Main Street")
Phonetic matching for name variants — catches cases where abbreviations or misspellings produce different strings but identical pronunciation
Abbreviation expansion — a configurable lookup that normalizes industry-standard abbreviations before other algorithms run, dramatically reducing false negatives
Exact matching on tax IDs and NPI numbers — when authoritative identifiers exist and are populated, they override fuzzy logic entirely

The ensemble layer evaluates all strategies for each candidate pair and selects the best match based on data completeness and field characteristics. A pair where both records have tax IDs gets resolved on exact match. A pair with name variants and partial addresses gets resolved on Jaro-Winkler + Jaccard. The system is transparent about which strategy produced each match and what the component scores were — no black-box ML scoring that you can't explain to an auditor.

The entire matching engine runs as Snowpark Python UDFs, executing inside Snowflake's compute layer. Your data never leaves the platform.

Survivorship Rules Engine

Once matches are identified, the system needs to decide which field values survive into the golden record. "Aetna" from CRM, "AETNA INC" from Claims, "Aetna Health Inc. (CT)" from Credentialing — which one becomes the master name?

We built a field-level survivorship configuration table that supports three resolution strategies:

Source priority: CMS Reference wins for legal names and tax IDs. CRM wins for relationship owner and primary contact. Credentialing wins for network status.
Most recent: For fields where currency matters — status, address, contact information — the most recently updated source value wins.
Most complete: For fields where population matters more than recency — if CRM has a full address with suite number and Claims has just a street line, the complete record wins.

The critical design decision: this configuration lives in a SQL table, not in code. A data steward can change survivorship rules — promote CMS as the authority for a new field, switch from most-recent to source-priority for addresses — without a code deployment. Edit the config table, rerun the dbt model, done.

Hierarchy Management

Payors aren't flat. There are parent organizations, plan-level entities, and subsidiaries. "UnitedHealth Group" → "United Healthcare" → "UHC of the Midwest" is a three-level hierarchy that matters for contract analysis, network adequacy reporting, and regulatory filings.

The system infers hierarchical relationships from name patterns, shared tax IDs, and address clustering, then surfaces them for steward confirmation. Inferred hierarchies get a confidence score and a "proposed" status. A steward reviews, confirms or adjusts, and the relationship becomes authoritative. This hybrid approach — algorithmic inference plus human confirmation — gives you speed without sacrificing accuracy on the relationships that matter most.

The entire pipeline — source integration, matching, survivorship, hierarchy inference — runs as dbt models and Snowpark UDFs inside Snowflake. No external infrastructure. No data movement. No additional licenses. If you have Snowflake, you have everything you need.

The Stewardship Layer: Humans in the Loop

Automated matching gets you 80% of the way. The last 20% — the ambiguous matches, the edge cases, the business context that no algorithm captures — requires human judgment. Most MDM implementations treat stewardship as an afterthought: a clunky admin console buried in the MDM platform that only a specialist can navigate.

We built a six-page React stewardship console designed for data teams and operations staff, not MDM specialists:

Match review workbench: Side-by-side comparison of candidate pairs with match scores, strategy breakdowns, and field-level highlights. Keyboard shortcuts — Y for yes, N for no, S for skip — let a steward process ambiguous matches at speed. No mouse required for the core workflow.
Golden record viewer: The current state of every mastered payor — which sources contributed, which survivorship rules applied, what the steward overrode manually.
Hierarchy manager: Visual parent-plan-subsidiary tree with drag-and-drop reorganization and proposed-relationship review.
Audit trail: Every match decision, every survivorship override, every hierarchy change — timestamped, attributed, and queryable. When compliance asks "why is this the golden record for Cigna?", the answer is three clicks away.
Data quality scorecard: Completeness, consistency, and match rate metrics across all source systems. Shows where the data is strong, where it's degrading, and which sources need attention.

The console talks to Snowflake via the SQL API — no middleware, no application server. Steward decisions write back to Snowflake tables that feed the next dbt run. The loop is tight: review a match, confirm it, rerun the model, see the updated golden record. Minutes, not days.

Results: 149 Records to 46 Golden Payors

In our initial deployment, the system ingested 149 payor records from four source systems. After automated matching with ensemble strategy selection: 46 golden payor records. A 69% deduplication rate — and every merge is auditable, reversible, and explainable.

The breakdown:

High-confidence automated merges: 72 records resolved into 23 golden records without human intervention — exact tax ID matches, Jaro-Winkler name scores above 0.95, consistent addresses
Steward-reviewed merges: 41 records surfaced for human review — ambiguous name matches, conflicting tax IDs, address discrepancies that could indicate separate subsidiaries rather than duplicates
Singleton records: 36 records that represent genuinely unique payors with no match candidates

Total steward review time for the 41 ambiguous pairs: under two hours. Try that with Informatica.

A Deployable Accelerator, Not a Product

This isn't a SaaS platform with a login page and a pricing tier. It's an architecture pattern and a deployable accelerator — dbt models, Snowpark UDFs, survivorship configuration, and a stewardship console that we stand up inside your Snowflake environment. Your data stays in your account. Your compute runs on your warehouse. Your team owns the code.

Deployment timeline: days, not quarters. The matching strategies are configurable. The survivorship rules are table-driven. The stewardship console connects to your Snowflake instance. What changes between deployments is configuration, not architecture — which sources feed in, what the survivorship priorities are, how the hierarchy maps to your organizational reality.

For data consultancies and lean internal teams, this is the model that works: an opinionated accelerator that handles the hard engineering (fuzzy matching, ensemble strategy selection, survivorship logic) while leaving the domain decisions (source priority, hierarchy structure, match thresholds) to the people who know the data.

Your payor data is a mess. It's been a mess for years. The tools to fix it have existed for decades — they've just been too expensive, too complex, and too slow. They don't have to be.

Ready to Master Your Payor Data?

CV Health deploys Snowflake-native MDM accelerators that clean up payor master data in days — no new infrastructure, no six-figure licenses, no year-long implementations.

Let's Talk