Your Data Is Dirty. Let's Fix That.

Blog

•

2

Mins

Share

Copy link

Uri Bushey

VP, Product

Your Data Is Dirty. Let's Fix That.

Every company has dirty data. That's not a criticism, it's just what happens when humans interact with systems. Teams type things in by hand. Forms don't enforce standards. Data gets imported from three different tools with three different naming conventions. Over time, the same thing gets recorded a dozen different ways.

And your models, dashboards, and AI agents are only as good as what goes in.

The problem isn't the mess, it's what you do about it

Most teams pick one of two fixes.

To make it concrete: imagine a vet clinic where staff type in dog breeds by hand. You end up with "Dashound," "Dachs Hund," "Wiener Dog" — all meaning the same thing, all treated as different values by your database.

The fix teams usually reach for first is rules: if the value contains "wiener" or "dach," map it to Dachshund. Rules work great for structured data. For real-world data - typed in by real people, scraped from the web, imported from legacy systems - they become brittle fast. One unexpected spelling and the rule misses it.

The second fix is manual cleanup. Someone opens a spreadsheet and fixes rows one by one. That works once. Tomorrow there's new data, and you're right back where you started.

Neither scales. Neither sticks.

Normalization as infrastructure

At Narrative, we think about this differently. Dirty data isn't a cleanup problem, it's a normalization problem. You have messy input on one side and a definition of "correct" on the other. The challenge is connecting the two reliably, automatically, and at scale.

That's what the Rosetta Stone Normalization Engine does.

You start by defining what clean looks like: a canonical attribute, your single source of truth for what a valid value is. It could be a global taxonomy Narrative maintains, or your own custom definition built around your business logic. Either way, it's explicit: one spelling, one casing, no ambiguity.

Then Rosetta Stone does the matching. It works in layers:

Hard-coded mappings first.

Then a classical ML classifier for synonyms and alternate spellings.

And finally, an LLM for only the truly novel or ambiguous rows.

A smart cache underneath means you never pay to resolve the same value twice. Most teams trying to solve this today point an LLM at the whole column and burn tokens on rows a simple rule could have handled. This approach only reaches for the expensive tool when it actually needs it.

The result: "Dashound," "Dachs Hund," "Wiener Dog," and yes, even a hot-dog emoji all resolve to the same canonical value automatically, continuously, as new data arrives.

The pattern shows up everywhere

We demonstrated this with dog breeds because the problem is immediately visible. But the same pattern runs through almost every industry: job titles in your CRM, product names across a merged catalog, diagnosis codes in healthcare, counterparty names in finance.

The column changes, the stakes change, the problem is identical.

Dirty data is the reason your data team spends more time cleaning than analyzing. It's the reason your AI is unreliable. It's the reason your dashboards are wrong.

Rosetta Stone Normalization Engine is the layer that fixes it. Not once, not manually, but continuously, at the point where data enters your system.

‍

Explore related

Insights, stories, & resources for the teams building modern data infrastructure.

Blog

•

4

Mins

Narrative Recognized as “One to Watch” in Snowflake’s Modern Marketing Data Stack Report

Blog

•

2

Mins

Composable AI Is a Competitive Advantage. Here's What That Looks Like for You.

Blog

•

4

Mins

Narrative Reimagines the Marketplace: A Composable Hub for Data and AI Work