WTF is Unicode?
Unicode standardizes characters in a way that's universal for computers to understand. Before unicode, text and email between transaltion garbled, unreadable text so that when you text 🙂 you see an emoji instead of `���???.
Why Unicode exists
Computers started with ASCII: 128 characters, English-shaped, good enough until it was not. Every country built its own encodings. The same byte might be a letter in ISO-8859-1 and garbage in Windows-1252. Sharing files across borders was a lottery.
Unicode is one catalog of every character anyone asked to standardize. Each entry gets a number called a code point. The letter A is U+0041. The smiley 😀 is U+1F600. The shrug kaomoji ¯\_(ツ)_/¯ is a sequence of several code points, not one.
The Unicode Consortium publishes the list. UTF-8, UTF-16, and UTF-32 are different ways to store those numbers in bytes. Confusing “Unicode” with “UTF-8” is the most common mistake in text handling. Unicode is the phone book; UTF-8 is one way to write the numbers down.
Unicode does not store characters. It assigns meaning to integers. Everything else is encoding and font rendering.
Code points vs characters
What you see as one “character” on screen is often several code points. A user thinks é is one key. Unicode can represent it as:
- One code point: U+00E9 (LATIN SMALL LETTER E WITH ACUTE)
- Two code points: U+0065 U+0301 (e + combining acute accent)
They look identical. They are not equal in a naive string compare. That distinction will break your search index if you ignore it.
Grapheme clusters
What humans perceive as one character is a grapheme cluster. Emoji make this obvious. 👨👩👧 is multiple people joined with zero-width joiners (U+200D). 😵💫 is dizzy face + ZWJ + swirl. Your programming language’s .length does not count graphemes unless you ask it to.
Surrogate pairs in UTF-16
JavaScript strings are UTF-16. Code points above U+FFFF are stored as two 16-bit units called a surrogate pair. So "😀".length === 2 in JS even though you see one emoji. Use the spread operator or Array.from to count code points:
const grin = "😀";
grin.length; // 2 (UTF-16 code units)
[...grin].length; // 1 (code points)
grin.codePointAt(0); // 128512 (0x1F600)
UTF-8 and bytes on the wire
UTF-8 won the internet because it is backward-compatible with ASCII and compact for English. Code points under 128 use one byte. Higher code points use two to four bytes. A JSON API, HTML document, and git repo are overwhelmingly UTF-8 today.
When you set <meta charset="utf-8"> in HTML, you tell the browser how to turn bytes into code points. If the declaration lies, you get mojibake: é instead of é.
UTF-16 still shows up inside browsers (DOM strings) and Windows internals. UTF-32 is mostly for tooling. As a web developer, you live in UTF-8 on the network and UTF-16 in the JS runtime. Know both.
BOM
UTF-8 can start with a byte order mark (U+FEFF). Some editors add it; some parsers choke on it. If your string equality tests fail on “identical” files, check for an invisible BOM at the start.
Normalization
Unicode defines normalization forms so equivalent sequences can be compared reliably:
- NFC — composed (common on macOS and the web)
- NFD — decomposed (common on macOS for filenames)
const composed = "\u00E9"; // é as one code point
const decomposed = "e\u0301"; // e + combining accent
composed === decomposed; // false
composed.normalize("NFC") === decomposed.normalize("NFC"); // true
Normalize before you index, search, or hash user text. Passwords and identifiers are especially sensitive: two usernames that look the same might be different byte sequences.
How browsers interpret text
The browser pipeline, simplified:
- Bytes arrive (HTTP, file, or clipboard).
- Encoding converts bytes → code points (UTF-8 for modern pages).
- The DOM stores text mostly as UTF-16 in JavaScript bindings.
- CSS picks a font and maps code points to glyphs.
If the right glyph does not exist in any font in your stack, you see tofu (□) or a fallback from another font. Emoji need color fonts (Apple Color Emoji, Segoe UI Emoji, Noto Color Emoji). Kaomoji are usually plain text in a monospace font—no special emoji font required, but missing symbols still fail.
TextEncoder and TextDecoder
The Web Platform exposes explicit UTF-8 conversion:
const bytes = new TextEncoder().encode("😀"); // Uint8Array, 4 bytes
const back = new TextDecoder("utf-8").decode(bytes); // "😀"
Use these when you talk to binary APIs (WASM, crypto, files) instead of guessing encodings.
Fetch and charset
For fetch(), the response body is bytes. The browser uses the Content-Type charset (or UTF-8 default for JSON) when you call .text(). Mismatched server headers corrupt international text silently.
JavaScript string traps
Three bugs appear in production code more than anything else:
- Using
.lengthfor “character count.” It counts UTF-16 units. - Regex without
u./^.$/can break on astral symbols;/^.$/umatches one code point. - Slicing by index.
str.slice(0, 1)can split a surrogate pair and produce invalid strings.
Prefer String.prototype.codePointAt, String.fromCodePoint, and Intl.Segmenter (where available) for user-visible character boundaries:
const segmenter = new Intl.Segmenter("en", { granularity: "grapheme" });
[...segmenter.segment("👨👩👧")].length; // 1 grapheme
For clipboard writes, modern browsers support navigator.clipboard.writeText(). It takes a DOM string (UTF-16) and puts UTF-8 on the system clipboard. The grid on the home page uses a hidden textarea and document.execCommand('copy') for broader compatibility—the bytes you get are still the same code points.
Emoji, kaomoji, and your clipboard
This site has two kinds of entries in one grid:
- Kaomoji — ASCII and Unicode symbols arranged as art (
¯\_(ツ)_/¯). Long lines; monospace helps alignment. - Emoji smileys — standardized colorful characters in the Unicode emoji blocks (
😀).
When you click a card, you copy the exact code point sequence stored in data—not a PNG, not a shortcut. Paste into an app that supports that script and font, and it works. Paste into a terminal that only loves ASCII, and you get noise.
Zero-width joiners and variation selectors
Sequences like ☺️ often include U+FE0F (variation selector-16) to request the emoji presentation. ☺ without it may render as text. Your byte count changes; your eyes might not notice.
Bidirectional text
Some kaomoji in the wild mix scripts and invisible right-to-left marks. Unicode can reorder glyphs for display. That is why the source list warns you to paste at your own risk. Browsers follow the Unicode Bidirectional Algorithm; your editor might not.
Cheatsheet
| You want to… | Use |
|---|---|
| Count code points in JS | [...str].length or Array.from(str) |
| Count user-visible characters | Intl.Segmenter with grapheme |
| Compare “same looking” strings | a.normalize("NFC") === b.normalize("NFC") |
| Bytes for UTF-8 API | new TextEncoder().encode(str) |
| Declare page encoding | <meta charset="utf-8"> |
| Copy text from the web | navigator.clipboard.writeText() or selection + clipboard |
Further reading
- Unicode Standard — the specification
- MDN: JavaScript String
- MDN: Encoding API