Utilix knowledge base
Unicode, UTF-8, and Why Encoding Matters for Developers
Published May 1, 2026
Every text file, API response, and database column has a character encoding — a rule that maps bytes to characters. When the encoding used to write data does not match the encoding used to read it, the result is garbled text, broken APIs, and corrupted data.
Unicode: the universal character set
Unicode is a standard that assigns a unique number (called a code point) to every character from every writing system. As of Unicode 15.1, there are over 149,000 defined code points covering Latin, Arabic, Chinese, Japanese, Korean, emoji, musical symbols, and more.
Code points are written as U+XXXX in hexadecimal. Examples:
U+0041→ AU+00E9→ é (e with acute accent)U+1F600→ 😀
Unicode defines what characters exist and their properties. It does not dictate how they are stored as bytes — that is the job of an encoding.
UTF-8: the dominant encoding
UTF-8 encodes each Unicode code point as 1–4 bytes, using the following scheme:
| Code point range | Bytes used | Notes |
|---|---|---|
| U+0000 – U+007F | 1 byte | Identical to ASCII |
| U+0080 – U+07FF | 2 bytes | Latin extended, Arabic, Hebrew |
| U+0800 – U+FFFF | 3 bytes | Most CJK characters |
| U+10000 – U+10FFFF | 4 bytes | Emoji, rare scripts |
The key properties of UTF-8:
- ASCII-compatible: any ASCII text is valid UTF-8 without modification
- Self-synchronising: you can find the start of any character by scanning forward
- Dominant on the web: over 98% of web pages use UTF-8
Common failure modes
Mojibake (garbled text): Happens when UTF-8 bytes are interpreted as another encoding — most commonly Latin-1 (ISO-8859-1). The byte sequence for é in UTF-8 (0xC3 0xA9) reads as é in Latin-1.
Splitting at byte boundaries: Slicing a UTF-8 string at a byte offset can cut a multi-byte sequence in half, corrupting the character. In Python 3, strings are Unicode objects and slicing by index is safe. In languages that expose raw bytes (C, Go with []byte), always slice by character, rune, or code point.
Emoji and surrogate pairs: Emoji above U+FFFF require 4 UTF-8 bytes and — in UTF-16 — a surrogate pair. Databases and APIs that only support UCS-2 (like older MySQL utf8 type, not utf8mb4) will reject emoji. Always use utf8mb4 in MySQL for full Unicode support.
UTF-8 vs UTF-16 vs UTF-32
| Encoding | Byte width | Used by |
|---|---|---|
| UTF-8 | Variable (1–4) | Web, Linux, macOS, most APIs |
| UTF-16 | Variable (2 or 4) | Windows internals, Java, JavaScript strings |
| UTF-32 | Fixed (4) | Some processing pipelines, rare in storage |
Practical rules
- Always specify encoding explicitly — never rely on system locale defaults in file I/O
- Use UTF-8 everywhere unless the platform requires otherwise
- Validate inputs at system boundaries — reject or sanitise invalid byte sequences
- Never compare byte length with character length for multi-byte text
Encode binary data safely for text channels with Base64 — Base64 converts arbitrary bytes to a pure ASCII string after UTF-8 serialisation.