Developer Tools
Unicode Inspector
Inspect every character in a string. See Unicode code points, hex values, HTML entities, CSS escapes, and character names for each character.
| Char ▲▼ | Code Point ▲▼ | Hex ▲▼ | HTML Entity ▲▼ | CSS Escape ▲▼ | Category ▲▼ |
|---|---|---|---|---|---|
| H | 72 | Latin Uppercase | |||
| e | 101 | Latin Lowercase | |||
| l | 108 | Latin Lowercase | |||
| l | 108 | Latin Lowercase | |||
| o | 111 | Latin Lowercase | |||
| · | 32 | Space | |||
| 😀 | 128512 | Emoji |
What is Unicode?
Unicode is a universal character encoding standard that assigns a unique numeric code point to every character in every writing system. Unicode 15.0 (2022) defines over 149,000 characters across 161 scripts, including Latin, Arabic, Chinese, emoji, musical notation, and ancient scripts.
Code points and encodings
- Code point: the number assigned to a character (e.g., U+0041 = 'A', U+1F600 = 😀).
- UTF-8: variable-width encoding; ASCII characters use 1 byte, common scripts 2–3 bytes, emoji 4 bytes. Most common on the web.
- UTF-16: uses 2 or 4 bytes. Used internally by JavaScript, Java, and Windows APIs.
- UTF-32: fixed 4 bytes per character. Simple but memory-intensive.
Grapheme clusters
What appears as a single "character" on screen may be multiple Unicode code points. For
example, the flag emoji 🇺🇸 is two regional indicator symbols (U+1F1FA U+1F1F8). A woman with
red hair 👩🦰 is four code points joined by zero-width joiners. This is why
"👩🦰".length in JavaScript returns 7, not 1.
Unicode confusables and security
Visually identical characters from different scripts can be exploited in IDN homograph attacks: for example, the Cyrillic small letter ‘a’ (U+0430) looks identical to the Latin ‘a’ (U+0061) in most fonts, but they are entirely different characters. Attackers register domains like “apрle.com” (with a Cyrillic р) to spoof trusted brands. The inspector helps identify such substitutions in URLs, source code, or any text where character identity matters.
Emoji composition example
The rainbow flag emoji 🏳️🌈 is composed of four code points:
- U+1F3F3 - WHITE FLAG 🏳
- U+FE0F - VARIATION SELECTOR-16 (forces emoji presentation)
- U+200D - ZERO WIDTH JOINER (ZWJ, glues adjacent emoji into a single glyph)
- U+1F308 - RAINBOW 🌈
ZWJ sequences are how most “family” and “profession” emoji are formed. The result is rendered as a single glyph by emoji-aware renderers, but tools that lack ZWJ support display the individual emoji side by side.
UTF-8 byte counts
| Character category | Byte count | Examples |
|---|---|---|
| ASCII (U+0000–U+007F) | 1 byte | A, 0, space, @ |
| Latin extended, Greek, Hebrew, Arabic (U+0080–U+07FF) | 2 bytes | é, α, א, ع |
| CJK ideographs, most symbols (U+0800–U+FFFF) | 3 bytes | 中, 日, €, ♥ |
| Emoji, rare scripts (U+10000–U+10FFFF) | 4 bytes | 😀, 👍, 🌍 |
JavaScript’s string.length counts UTF-16 code units, not bytes or grapheme clusters.
A character in the 4-byte UTF-8 range is stored as a UTF-16 surrogate pair and contributes 2 to
.length, which is why "😀".length === 2 in JavaScript.