Skip to content
Toolcroft

Text Tools

Text Similarity Calculator - Compare Two Texts

Measure the similarity between two texts using Levenshtein distance, Jaccard, cosine, or overlap coefficient algorithms.

How similarity is measured

Four algorithms are available, each suited to different use cases:

  • Levenshtein (edit distance): Counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to turn one string into the other. Best for comparing short strings like names or codes.
  • Jaccard: Measures how much the token sets overlap relative to their union. Scores 1 when both texts share exactly the same unique words.
  • Cosine: Compares term-frequency vectors. Considers how often words appear, not just whether they appear. Good for longer texts.
  • Overlap coefficient: Intersection divided by the size of the smaller set. Returns 1 whenever the smaller text's words all appear in the larger text.

Tokenization

Choose words to split on whitespace (recommended for most text comparisons) or characters to compare at the character level (better for short codes or IDs).

When to use which algorithm

AlgorithmBest forNot ideal for
Levenshtein distance Typo detection, short strings, spell-check Long documents (O(m×n) time)
Jaccard similarity Set-based duplicate document detection Order-sensitive comparisons
Cosine similarity Document/paragraph similarity, TF-IDF comparison Very short strings (<5 words)
Overlap coefficient One-directional containment (is A in B?) Bidirectional similarity measurement

Worked examples

Levenshtein: "kitten" vs "sitting" - the minimum edit distance is 3: substitute k->s, substitute e->i, insert g at end. With 7 characters in the longer string, the normalised similarity is 1 − (3/7) ≈ 0.57.

Jaccard: two sentences - "the cat sat on the mat" (token set: {the, cat, sat, on, mat}) vs "the dog sat on the rug" (token set: {the, dog, sat, on, rug}). Intersection: {the, sat, on} = 3 tokens. Union: 7 tokens. Jaccard = 3/7 ≈ 0.43.

Fuzzy matching applications

  • Spell-check: find dictionary words within edit distance 1–2 of the mistyped word.
  • Record linkage: match customer records across databases where names may differ in spelling, middle initials, or hyphenation.
  • Plagiarism detection: high cosine similarity between paragraphs with different word order suggests paraphrasing.
  • Customer deduplication: identify duplicate accounts from form submissions where the same person enters their name slightly differently each time.