How similarity is measured

Four algorithms are available, each suited to different use cases:

Levenshtein (edit distance): Counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to turn one string into the other. Best for comparing short strings like names or codes.
Jaccard: Measures how much the token sets overlap relative to their union. Scores 1 when both texts share exactly the same unique words.
Cosine: Compares term-frequency vectors. Considers how often words appear, not just whether they appear. Good for longer texts.
Overlap coefficient: Intersection divided by the size of the smaller set. Returns 1 whenever the smaller text's words all appear in the larger text.

Tokenization

Choose words to split on whitespace (recommended for most text comparisons) or characters to compare at the character level (better for short codes or IDs).

When to use which algorithm

Algorithm	Best for	Not ideal for
Levenshtein distance	Typo detection, short strings, spell-check	Long documents (O(m×n) time)
Jaccard similarity	Set-based duplicate document detection	Order-sensitive comparisons
Cosine similarity	Document/paragraph similarity, TF-IDF comparison	Very short strings (<5 words)
Overlap coefficient	One-directional containment (is A in B?)	Bidirectional similarity measurement

Worked examples

Levenshtein: "kitten" vs "sitting" - the minimum edit distance is 3: substitute k->s, substitute e->i, insert g at end. With 7 characters in the longer string, the normalised similarity is 1 − (3/7) ≈ 0.57.

Jaccard: two sentences - "the cat sat on the mat" (token set: {the, cat, sat, on, mat}) vs "the dog sat on the rug" (token set: {the, dog, sat, on, rug}). Intersection: {the, sat, on} = 3 tokens. Union: 7 tokens. Jaccard = 3/7 ≈ 0.43.

Fuzzy matching applications

Spell-check: find dictionary words within edit distance 1–2 of the mistyped word.
Record linkage: match customer records across databases where names may differ in spelling, middle initials, or hyphenation.
Plagiarism detection: high cosine similarity between paragraphs with different word order suggests paraphrasing.
Customer deduplication: identify duplicate accounts from form submissions where the same person enters their name slightly differently each time.