Distance matching compares two text values and returns a similarity score.

It is often used for fuzzy deduplication, where values are not identical but may still refer to the same person, company, product, or address.

Common distance algorithms

Two common algorithms are:

Levenshtein focuses on the number of edits needed to transform one string into another. Jaro-Winkler is often useful for shorter names because it gives more weight to shared prefixes.

What is a matching threshold?

A threshold defines how similar two values must be to count as a match.

For example, a threshold of 0.80 means the similarity score must reach 80%.

Lower thresholds find more possible duplicates but increase false positives. Higher thresholds are stricter but can miss messy duplicates.

📌 Practical rule

Start with a stricter threshold. Lower it only when you need broader candidate groups for manual review.

Datablist workflow

Datablist supports distance matching in the Duplicates Remover, with Levenshtein and Jaro-Winkler options.

Use distance matching for text columns such as names, company names, addresses, or product titles. Use exact or smart matching for identifiers such as emails, URLs, and domains.

For name matching, compare distance matching with phonetic matching. For broader context, read What is fuzzy matching?.