What is distance matching?

Question

Florian Poullin · Accepted Answer

Distance matching compares two text values and returns a similarity score.

It is often used for fuzzy deduplication, where values are not identical but may still refer to the same person, company, product, or address.

Common distance algorithms

Two common algorithms are:

Levenshtein focuses on the number of edits needed to transform one string into another. Jaro-Winkler is often useful for shorter names because it gives more weight to shared prefixes.

What is a matching threshold?

A threshold defines how similar two values must be to count as a match.

For example, a threshold of 0.80 means the similarity score must reach 80%.

Lower thresholds find more possible duplicates but increase false positives. Higher thresholds are stricter but can miss messy duplicates.

📌 Practical rule

Start with a stricter threshold. Lower it only when you need broader candidate groups for manual review.

Datablist workflow

Datablist supports distance matching in the Duplicates Remover, with Levenshtein and Jaro-Winkler options.

Distance matching is available on paid plans. Select the algorithm and set its similarity threshold for the chosen text property.

Use distance matching for text columns such as names, company names, addresses, or product titles. Use exact or smart matching for identifiers such as emails, URLs, and domains.

For name matching, compare distance matching with phonetic matching. The main deduplication guide includes a tested Levenshtein workflow and examples of name-only false positives.