What is Unicode?

Unicode is a universal character encoding standard that assigns a unique numeric code point to every character in every writing system. Unicode 15.0 (2022) defines over 149,000 characters across 161 scripts, including Latin, Arabic, Chinese, emoji, musical notation, and ancient scripts.

Code points and encodings

Code point: the number assigned to a character (e.g., U+0041 = 'A', U+1F600 = 😀).
UTF-8: variable-width encoding; ASCII characters use 1 byte, common scripts 2–3 bytes, emoji 4 bytes. Most common on the web.
UTF-16: uses 2 or 4 bytes. Used internally by JavaScript, Java, and Windows APIs.
UTF-32: fixed 4 bytes per character. Simple but memory-intensive.

Grapheme clusters

What appears as a single "character" on screen may be multiple Unicode code points. For example, the flag emoji 🇺🇸 is two regional indicator symbols (U+1F1FA U+1F1F8). A woman with red hair 👩‍🦰 is four code points joined by zero-width joiners. This is why "👩‍🦰".length in JavaScript returns 7, not 1.

Unicode confusables and security

Visually identical characters from different scripts can be exploited in IDN homograph attacks: for example, the Cyrillic small letter ‘a’ (U+0430) looks identical to the Latin ‘a’ (U+0061) in most fonts, but they are entirely different characters. Attackers register domains like “apрle.com” (with a Cyrillic р) to spoof trusted brands. The inspector helps identify such substitutions in URLs, source code, or any text where character identity matters.

Emoji composition example

The rainbow flag emoji 🏳️‍🌈 is composed of four code points:

U+1F3F3 - WHITE FLAG 🏳
U+FE0F - VARIATION SELECTOR-16 (forces emoji presentation)
U+200D - ZERO WIDTH JOINER (ZWJ, glues adjacent emoji into a single glyph)
U+1F308 - RAINBOW 🌈

ZWJ sequences are how most “family” and “profession” emoji are formed. The result is rendered as a single glyph by emoji-aware renderers, but tools that lack ZWJ support display the individual emoji side by side.

UTF-8 byte counts

Character category	Byte count	Examples
ASCII (U+0000–U+007F)	1 byte	A, 0, space, @
Latin extended, Greek, Hebrew, Arabic (U+0080–U+07FF)	2 bytes	é, α, א, ع
CJK ideographs, most symbols (U+0800–U+FFFF)	3 bytes	中, 日, €, ♥
Emoji, rare scripts (U+10000–U+10FFFF)	4 bytes	😀, 👍, 🌍

JavaScript’s string.length counts UTF-16 code units, not bytes or grapheme clusters. A character in the 4-byte UTF-8 range is stored as a UTF-16 surrogate pair and contributes 2 to .length, which is why "😀".length === 2 in JavaScript.

Char ▲▼	Code Point ▲▼	Category ▲▼
H	72	Latin Uppercase
e	101	Latin Lowercase
l	108	Latin Lowercase
l	108	Latin Lowercase
o	111	Latin Lowercase
·	32	Space
😀	128512	Emoji