From ASCII to UTF-8: How Text Learned Every Language
Every character you see on screen has a history. Behind the simplicity of letters, symbols, and emojis lies decades of decisions about how to represent human language in binary.
This is the story of how we got from 128 characters to every writing system on Earth — and why UTF-8 won.
The ASCII era
Early computers used ASCII, created in the 1960s.
ASCII used 7 bits, giving only 128 characters:
- English letters
- Numbers
- Punctuation
That worked for American English but not for the rest of the world.
Different countries created their own extensions: Latin-1, Windows-1252, Shift-JIS, KOI8-R, and dozens more.
The same byte value could represent different characters in different systems.
Text exchange became chaos.
Unicode to the rescue
Unicode was created to fix this by giving every character in every language a unique number, called a code point.
Examples:
A -> U+0041
° -> U+00B0
中 -> U+4E2D
😀 -> U+1F600But Unicode numbers alone do not define how those numbers are stored in memory.
That's where encodings come in.
Why not UTF-32?
UTF-32 is the simplest idea imaginable.
Every character uses 4 bytes.
A = 00 00 00 41No parsing needed.
But this wastes huge amounts of memory. Most text uses simple characters.
A small log file would instantly become four times larger.
Why not UTF-16?
UTF-16 uses mostly 2 bytes per character, with special cases requiring 4 bytes.
It is more compact than UTF-32 but still inefficient for ASCII-heavy text like:
- Code
- Logs
- Configuration files
- HTML
- JSON
Those dominate the internet.
Why UTF-8 won
UTF-8 solved the problem elegantly.
ASCII characters remain exactly the same:
A -> 41So old systems continue to work.
More complex characters expand only when needed.
Typical English text stays compact and efficient, while still supporting every language.
That design decision made UTF-8 the dominant encoding on the internet.
Today more than 95% of web pages use UTF-8.
From bytes to pixels: fonts and glyphs
Once a parser decodes bytes into a Unicode code point, the job isn't finished.
The system still needs to draw the character.
That's where fonts come in.
A font contains glyphs.
A glyph is the actual shape used to render a character. It is the visual representation — the specific outline, curves, and strokes that define how a character looks on screen or in print. Every letter, symbol, or emoji you see is a glyph drawn from a font file.
Example flow:
UTF-8 bytes
↓
UTF-8 parser
↓
Unicode code point
↓
Font lookup
↓
Glyph drawingFor the degree symbol:
C2 B0
↓
U+00B0
↓
font lookup
↓
glyph for °If the font lacks the glyph, the system typically shows a fallback symbol like:
□or
�These fallback symbols are the system's way of saying: "I know a character belongs here, but I don't have a shape for it."
The difference between □ and � is subtle but important. The empty box means the font doesn't have the glyph. The replacement character means the bytes themselves were invalid — the parser couldn't even determine which character was intended.
The invisible pipeline
Every time you read text on a screen, an invisible pipeline runs:
- Raw bytes are stored or transmitted
- A parser decodes them into code points
- A font maps code points to glyphs
- A renderer draws the glyphs
Each step can fail. Bad bytes break the parser. Missing glyphs break the renderer. But when it all works, you see text — and never think about the decades of engineering behind it.
From ASCII's 128 characters in the 1960s to Unicode's 150,000+ characters today, the goal has always been the same: let humans read and write in their own language, and let computers handle the rest.
UTF-8 made that possible without breaking the past.