Unicode & UTF-8: Why Encoding Matters

Unicode, UTF-8, and Why Encoding Matters for Developers

Every text file, API response, and database column has a character encoding — a rule that maps bytes to characters. When the encoding used to write data does not match the encoding used to read it, the result is garbled text, broken APIs, and corrupted data.

Unicode: the universal character set

Unicode is a standard that assigns a unique number (called a code point) to every character from every writing system. As of Unicode 15.1, there are over 149,000 defined code points covering Latin, Arabic, Chinese, Japanese, Korean, emoji, musical symbols, and more.

Code points are written as U+XXXX in hexadecimal. Examples:

U+0041 → A
U+00E9 → é (e with acute accent)
U+1F600 → 😀

Unicode defines what characters exist and their properties. It does not dictate how they are stored as bytes — that is the job of an encoding.

UTF-8: the dominant encoding

UTF-8 encodes each Unicode code point as 1–4 bytes, using the following scheme:

Code point range	Bytes used	Notes
U+0000 – U+007F	1 byte	Identical to ASCII
U+0080 – U+07FF	2 bytes	Latin extended, Arabic, Hebrew
U+0800 – U+FFFF	3 bytes	Most CJK characters
U+10000 – U+10FFFF	4 bytes	Emoji, rare scripts

The key properties of UTF-8:

ASCII-compatible: any ASCII text is valid UTF-8 without modification
Self-synchronising: you can find the start of any character by scanning forward
Dominant on the web: over 98% of web pages use UTF-8

Common failure modes

Mojibake (garbled text): Happens when UTF-8 bytes are interpreted as another encoding — most commonly Latin-1 (ISO-8859-1). The byte sequence for é in UTF-8 (0xC3 0xA9) reads as Ã© in Latin-1.

Splitting at byte boundaries: Slicing a UTF-8 string at a byte offset can cut a multi-byte sequence in half, corrupting the character. In Python 3, strings are Unicode objects and slicing by index is safe. In languages that expose raw bytes (C, Go with []byte), always slice by character, rune, or code point.

Emoji and surrogate pairs: Emoji above U+FFFF require 4 UTF-8 bytes and — in UTF-16 — a surrogate pair. Databases and APIs that only support UCS-2 (like older MySQL utf8 type, not utf8mb4) will reject emoji. Always use utf8mb4 in MySQL for full Unicode support.

UTF-8 vs UTF-16 vs UTF-32

Encoding	Byte width	Used by
UTF-8	Variable (1–4)	Web, Linux, macOS, most APIs
UTF-16	Variable (2 or 4)	Windows internals, Java, JavaScript strings
UTF-32	Fixed (4)	Some processing pipelines, rare in storage

Practical rules

Always specify encoding explicitly — never rely on system locale defaults in file I/O
Use UTF-8 everywhere unless the platform requires otherwise
Validate inputs at system boundaries — reject or sanitise invalid byte sequences
Never compare byte length with character length for multi-byte text

Encode binary data safely for text channels with Base64 — Base64 converts arbitrary bytes to a pure ASCII string after UTF-8 serialisation.