Character Encodings Cheat Sheet
Character encoding defines how characters are represented as bytes. Understanding encodings is critical for web applications, APIs, databases, files, internationalization, and security.
Core Concepts
| Term | Description |
|---|---|
| Character | Human-readable symbol (A, Γ§, δΈ, π) |
| Code Point | Numeric identifier for a character |
| Encoding | Mapping from code points to bytes |
| Charset | Another term for encoding |
| Glyph | Visual representation of a character |
ASCII
ASCII (American Standard Code for Information Interchange)
- 7-bit encoding
- 128 characters
- English-only
| Range | Description |
|---|---|
| 0β31 | Control characters |
| 32β126 | Printable characters |
| 127 | DEL |
Example:
A β 65 β 0x41
β οΈ Cannot represent non-English characters.
Extended ASCII (Non-Standard)
- 8-bit (256 characters)
- Multiple incompatible variants
Examples: - ISO-8859-1 - Windows-1252
β οΈ No single standard β common source of bugs.
ISO-8859 Family
Single-byte encodings for specific languages.
| Encoding | Language Coverage |
|---|---|
| ISO-8859-1 | Western Europe |
| ISO-8859-2 | Central Europe |
| ISO-8859-5 | Cyrillic |
| ISO-8859-9 | Turkish |
Limitations: - Max 256 characters - Not multilingual
Unicode (Universal Character Set)
Unicode defines code points, not byte encoding.
Format:
U+XXXX
Examples:
A β U+0041
Γ§ β U+00E7
δΈ β U+4E2D
π β U+1F642
Unicode supports: - All modern languages - Symbols & emojis - Mathematical notation
UTF-8 (Web Standard)
UTF-8 is the most widely used encoding.
Characteristics: - Variable length (1β4 bytes) - ASCII-compatible - Backward compatible - No endianness issues
| Bytes | Range |
|---|---|
| 1 | U+0000 β U+007F |
| 2 | U+0080 β U+07FF |
| 3 | U+0800 β U+FFFF |
| 4 | U+10000 β U+10FFFF |
Example:
A β 41
Γ§ β C3 A7
π β F0 9F 99 82
β Recommended default encoding
UTF-16
- Variable length (2 or 4 bytes)
- Uses surrogate pairs
- Endianness matters
Variants: - UTF-16LE - UTF-16BE
Example:
π β D83D DE42
Used in: - Java - Windows APIs
UTF-32
- Fixed length (4 bytes)
- Simple indexing
- Very memory-inefficient
Example:
A β 00 00 00 41
Rarely used in practice.
Byte Order Mark (BOM)
Optional marker indicating encoding and endianness.
| Encoding | BOM |
|---|---|
| UTF-8 | EF BB BF |
| UTF-16LE | FF FE |
| UTF-16BE | FE FF |
| UTF-32LE | FF FE 00 00 |
| UTF-32BE | 00 00 FE FF |
β οΈ UTF-8 BOM can break scripts and APIs.
Normalization Forms
Unicode characters may have multiple representations.
Example:
Γ© β U+00E9
e + Β΄ β U+0065 U+0301
Normalization forms:
| Form | Description |
|---|---|
| NFC | Composed (recommended) |
| NFD | Decomposed |
| NFKC | Compatibility composed |
| NFKD | Compatibility decomposed |
Encodings on the Web
HTML
<meta charset="UTF-8">
HTTP Header
Content-Type: text/html; charset=UTF-8
Programming Languages
Python
s = "Γ§"
s.encode("utf-8")
s.decode("utf-8")
JavaScript
- Uses UTF-16 internally
- UTF-8 for I/O
new TextEncoder().encode("π")
Java
"Γ§".getBytes(StandardCharsets.UTF_8);
PHP
mb_internal_encoding("UTF-8");
Databases
| Database | Encoding |
|---|---|
| MySQL | utf8mb4 |
| PostgreSQL | UTF8 |
| SQLite | UTF-8 / UTF-16 |
| MongoDB | UTF-8 |
β οΈ MySQL utf8 β full Unicode
Use utf8mb4.
Security Considerations
- Homoglyph attacks
- Encoding confusion
- Double encoding vulnerabilities
- Incorrect normalization
Best Practices
- Always use UTF-8
- Explicitly declare encoding
- Normalize input
- Validate encoding at boundaries
- Avoid legacy encodings
- Use
utf8mb4in databases
Common Pitfalls
- Mixing encodings
- Missing charset headers
- Assuming ASCII
- Copy-paste corruption
- BOM-related bugs
Summary
- Unicode defines characters
- UTF-8 defines how they are stored
- UTF-8 is the universal standard
- Explicit encoding prevents bugs and vulnerabilities