Knowledge Base

Best Practices for Normalizing Emails Before Hashing

Best Practices for Normalizing Emails Before Hashing

To achieve reliable and consistent results when hashing email addresses, normalization is crucial. Without it, slight variations in email formatting can lead to mismatched hashes, affecting tasks like identity matching and deduplication. Here are four essential practices to follow when normalizing emails for hashing:

1. Convert to Lowercase

  • Why: Email addresses are generally case-insensitive, meaning variations like User@Domain.com and user@domain.com should be treated as identical.
  • How to Normalize: Convert the entire email address to lowercase.
  • Example: User@Domain.com becomes user@domain.com.

2. Trim Whitespace

  • Why: Users often add spaces at the beginning or end of their email by mistake, which can result in hashing inconsistencies.
  • How to Normalize: Remove all leading and trailing whitespace from the email.
  • Example: " user@domain.com " becomes user@domain.com.

3. Remove Periods in the Local Part for Gmail Addresses

  • Why: Gmail ignores periods in the local part of email addresses, treating username@gmail.com and user.name@gmail.com as identical addresses.
  • How to Normalize: For Gmail addresses (gmail.com and googlemail.com), remove all periods from the local part of the address.
  • Example: user.name@gmail.com becomes username@gmail.com.

4. Handle Aliases Properly

  • Why: Many providers, particularly Gmail, allow users to add “+” symbols in the local part to create unique aliases (e.g., user+alias@gmail.com), which are often used to filter or categorize emails. Normalizing aliases can prevent redundant entries in hashed data.
  • How to Normalize: Remove everything after a + symbol in the local part for providers that support this feature.
  • Example: user+promo@gmail.com becomes user@gmail.com.

Example Data: Normalization and Hashing

Below is example data showing raw emails, their normalized version, and the resulting SHA-256 hash of each normalized email:

Original EmailNormalized EmailSHA-256 Hash
User.Name+promo@gmail.comusername@gmail.comf7976af5cad49504523379918222f1da1c1e231cd5d0770603c3e71e5e30e030
admin@Example.comadmin@example.com258d8dc916db8cea2cafb6c3cd0cb0246efe061421dbd83ec3a350428cabda4f
JOHNDOE@Yahoo.COMjohndoe@yahoo.com0be12621ac1b083b13be10c0e16d3c0de3d72005056c4cfe72d218ce824f2977
user.name+news@gmail.comusername@gmail.comf7976af5cad49504523379918222f1da1c1e231cd5d0770603c3e71e5e30e030
alice@domain.comalice@domain.com4931ae9f911d21f9ba59dda70af400c9be1a06bedaeb60ad372e336c69d70f5b

Each normalized email is then hashed using SHA-256, resulting in a unique, consistent hash value for consistent identification across datasets. Following these steps minimizes duplicates and ensures data accuracy.

< Back
Rosetta

Hi! I’m Rosetta, your big data assistant. Ask me anything! If you want to talk to one of our wonderful human team members, let me know! I can schedule a call for you.