Back to Blog

Snowflake-Native Payor MDM: Master Data Management Without the Legacy Baggage

By Victor Wilson

Every health plan has the same dirty secret: nobody trusts the payor data. The CRM says "Aetna." Claims says "AETNA INC." Credentialing says "Aetna Health Inc. (CT)." CMS reference data says "Aetna Life Insurance Company." They're all the same payor. They all have different addresses. At least one has a stale tax ID. And somewhere in your warehouse, they exist as four separate entities — each feeding a different report that tells a different story.

This is the payor master data problem, and it's so pervasive in healthcare that most organizations have stopped trying to fix it. Not because they don't care. Because the traditional tools for fixing it — enterprise MDM platforms from Informatica, IBM, Stibo, Reltio — are so expensive, complex, and slow to implement that the cure feels worse than the disease.

We took a different approach. We built a payor MDM system that runs entirely inside Snowflake — dbt models, Snowpark Python UDFs, a React stewardship console, and a multi-strategy matching engine that doesn't require a PhD to understand or a six-figure license to run. It collapsed 149 messy source records into 46 golden payor records in a single session. And it deploys in days, not quarters.

Here's how it works, and why it matters.

The Payor Data Problem Nobody Talks About

Provider MDM gets all the attention. Patient matching has an entire industry built around it. But payor data? It sits in a weird blind spot — everyone knows it's messy, nobody prioritizes cleaning it up, and the downstream effects are insidious.

When your payor master is fragmented, you get:

The standard response is "we'll fix it in the next EHR upgrade" or "that's a data governance initiative for next year." Translation: it never gets fixed.

Why Legacy MDM Doesn't Fit

The enterprise MDM market exists to solve this class of problem. So why aren't health plans using it for payor data?

Three reasons:

The result: payor master data problems persist because the available tools are disproportionate to the problem. What's needed isn't a lighter version of enterprise MDM. It's a fundamentally different architecture.

What Snowflake-Native MDM Looks Like

Our system runs entirely within Snowflake. No external MDM engine. No data leaving the platform. No additional infrastructure. The architecture has four layers:

Source Integration

We ingest payor records from four source systems — CRM, Claims, Credentialing, and CMS Reference — each landing in staging models via dbt. The staging layer handles the first round of normalization: standardizing field names, cleaning whitespace, expanding common abbreviations (BCBSBlue Cross Blue Shield, UHCUnited Healthcare), and applying basic type casting. Every source record retains its provenance — which system it came from, when it was last updated, what the original values were.

Multi-Strategy Matching Engine

This is the core of the system, and where it diverges most from legacy approaches. Instead of running a single matching algorithm and trusting the score, we run an ensemble of strategies and let the data characteristics determine which one wins for each field and record pair.

The matching strategies:

The ensemble layer evaluates all strategies for each candidate pair and selects the best match based on data completeness and field characteristics. A pair where both records have tax IDs gets resolved on exact match. A pair with name variants and partial addresses gets resolved on Jaro-Winkler + Jaccard. The system is transparent about which strategy produced each match and what the component scores were — no black-box ML scoring that you can't explain to an auditor.

The entire matching engine runs as Snowpark Python UDFs, executing inside Snowflake's compute layer. Your data never leaves the platform.

Survivorship Rules Engine

Once matches are identified, the system needs to decide which field values survive into the golden record. "Aetna" from CRM, "AETNA INC" from Claims, "Aetna Health Inc. (CT)" from Credentialing — which one becomes the master name?

We built a field-level survivorship configuration table that supports three resolution strategies:

The critical design decision: this configuration lives in a SQL table, not in code. A data steward can change survivorship rules — promote CMS as the authority for a new field, switch from most-recent to source-priority for addresses — without a code deployment. Edit the config table, rerun the dbt model, done.

Hierarchy Management

Payors aren't flat. There are parent organizations, plan-level entities, and subsidiaries. "UnitedHealth Group" → "United Healthcare" → "UHC of the Midwest" is a three-level hierarchy that matters for contract analysis, network adequacy reporting, and regulatory filings.

The system infers hierarchical relationships from name patterns, shared tax IDs, and address clustering, then surfaces them for steward confirmation. Inferred hierarchies get a confidence score and a "proposed" status. A steward reviews, confirms or adjusts, and the relationship becomes authoritative. This hybrid approach — algorithmic inference plus human confirmation — gives you speed without sacrificing accuracy on the relationships that matter most.

The entire pipeline — source integration, matching, survivorship, hierarchy inference — runs as dbt models and Snowpark UDFs inside Snowflake. No external infrastructure. No data movement. No additional licenses. If you have Snowflake, you have everything you need.

The Stewardship Layer: Humans in the Loop

Automated matching gets you 80% of the way. The last 20% — the ambiguous matches, the edge cases, the business context that no algorithm captures — requires human judgment. Most MDM implementations treat stewardship as an afterthought: a clunky admin console buried in the MDM platform that only a specialist can navigate.

We built a six-page React stewardship console designed for data teams and operations staff, not MDM specialists:

The console talks to Snowflake via the SQL API — no middleware, no application server. Steward decisions write back to Snowflake tables that feed the next dbt run. The loop is tight: review a match, confirm it, rerun the model, see the updated golden record. Minutes, not days.

Results: 149 Records to 46 Golden Payors

In our initial deployment, the system ingested 149 payor records from four source systems. After automated matching with ensemble strategy selection: 46 golden payor records. A 69% deduplication rate — and every merge is auditable, reversible, and explainable.

The breakdown:

Total steward review time for the 41 ambiguous pairs: under two hours. Try that with Informatica.

A Deployable Accelerator, Not a Product

This isn't a SaaS platform with a login page and a pricing tier. It's an architecture pattern and a deployable accelerator — dbt models, Snowpark UDFs, survivorship configuration, and a stewardship console that we stand up inside your Snowflake environment. Your data stays in your account. Your compute runs on your warehouse. Your team owns the code.

Deployment timeline: days, not quarters. The matching strategies are configurable. The survivorship rules are table-driven. The stewardship console connects to your Snowflake instance. What changes between deployments is configuration, not architecture — which sources feed in, what the survivorship priorities are, how the hierarchy maps to your organizational reality.

For data consultancies and lean internal teams, this is the model that works: an opinionated accelerator that handles the hard engineering (fuzzy matching, ensemble strategy selection, survivorship logic) while leaving the domain decisions (source priority, hierarchy structure, match thresholds) to the people who know the data.

Your payor data is a mess. It's been a mess for years. The tools to fix it have existed for decades — they've just been too expensive, too complex, and too slow. They don't have to be.

Ready to Master Your Payor Data?

CV Health deploys Snowflake-native MDM accelerators that clean up payor master data in days — no new infrastructure, no six-figure licenses, no year-long implementations.

Let's Talk