Text Tools
Text Similarity Calculator - Compare Two Texts
Measure the similarity between two texts using Levenshtein distance, Jaccard, cosine, or overlap coefficient algorithms.
How similarity is measured
Four algorithms are available, each suited to different use cases:
- Levenshtein (edit distance): Counts the minimum number of single-character edits (insertions, deletions, substitutions) needed to turn one string into the other. Best for comparing short strings like names or codes.
- Jaccard: Measures how much the token sets overlap relative to their union. Scores 1 when both texts share exactly the same unique words.
- Cosine: Compares term-frequency vectors. Considers how often words appear, not just whether they appear. Good for longer texts.
- Overlap coefficient: Intersection divided by the size of the smaller set. Returns 1 whenever the smaller text's words all appear in the larger text.
Tokenization
Choose words to split on whitespace (recommended for most text comparisons) or characters to compare at the character level (better for short codes or IDs).
When to use which algorithm
| Algorithm | Best for | Not ideal for |
|---|---|---|
| Levenshtein distance | Typo detection, short strings, spell-check | Long documents (O(m×n) time) |
| Jaccard similarity | Set-based duplicate document detection | Order-sensitive comparisons |
| Cosine similarity | Document/paragraph similarity, TF-IDF comparison | Very short strings (<5 words) |
| Overlap coefficient | One-directional containment (is A in B?) | Bidirectional similarity measurement |
Worked examples
Levenshtein: "kitten" vs "sitting" - the minimum edit distance is 3: substitute k->s, substitute e->i, insert g at end. With 7 characters in the longer string, the normalised similarity is 1 − (3/7) ≈ 0.57.
Jaccard: two sentences - "the cat sat on the mat" (token set: {the, cat, sat, on, mat}) vs "the dog sat on the rug" (token set: {the, dog, sat, on, rug}). Intersection: {the, sat, on} = 3 tokens. Union: 7 tokens. Jaccard = 3/7 ≈ 0.43.
Fuzzy matching applications
- Spell-check: find dictionary words within edit distance 1–2 of the mistyped word.
- Record linkage: match customer records across databases where names may differ in spelling, middle initials, or hyphenation.
- Plagiarism detection: high cosine similarity between paragraphs with different word order suggests paraphrasing.
- Customer deduplication: identify duplicate accounts from form submissions where the same person enters their name slightly differently each time.