Character Encodings Cheat Sheet

Character encoding defines how characters are represented as bytes. Understanding encodings is critical for web applications, APIs, databases, files, internationalization, and security.

Core Concepts

Term	Description
Character	Human-readable symbol (A, ç, 中, 🙂)
Code Point	Numeric identifier for a character
Encoding	Mapping from code points to bytes
Charset	Another term for encoding
Glyph	Visual representation of a character

ASCII

ASCII (American Standard Code for Information Interchange)

7-bit encoding
128 characters
English-only

Range	Description
0–31	Control characters
32–126	Printable characters
127	DEL

Example:

A → 65 → 0x41

⚠️ Cannot represent non-English characters.

Extended ASCII (Non-Standard)

8-bit (256 characters)
Multiple incompatible variants

Examples: - ISO-8859-1 - Windows-1252

⚠️ No single standard → common source of bugs.

ISO-8859 Family

Single-byte encodings for specific languages.

Encoding	Language Coverage
ISO-8859-1	Western Europe
ISO-8859-2	Central Europe
ISO-8859-5	Cyrillic
ISO-8859-9	Turkish

Limitations: - Max 256 characters - Not multilingual

Unicode (Universal Character Set)

Unicode defines code points, not byte encoding.

Format:

U+XXXX

Examples:

A      → U+0041
ç      → U+00E7
中     → U+4E2D
🙂     → U+1F642

Unicode supports: - All modern languages - Symbols & emojis - Mathematical notation

UTF-8 (Web Standard)

UTF-8 is the most widely used encoding.

Characteristics: - Variable length (1–4 bytes) - ASCII-compatible - Backward compatible - No endianness issues

Bytes	Range
1	U+0000 – U+007F
2	U+0080 – U+07FF
3	U+0800 – U+FFFF
4	U+10000 – U+10FFFF

Example:

A  → 41
ç  → C3 A7
🙂 → F0 9F 99 82

✅ Recommended default encoding

UTF-16

Variable length (2 or 4 bytes)
Uses surrogate pairs
Endianness matters

Variants: - UTF-16LE - UTF-16BE

Example:

🙂 → D83D DE42

Used in: - Java - Windows APIs

UTF-32

Fixed length (4 bytes)
Simple indexing
Very memory-inefficient

Example:

A → 00 00 00 41

Rarely used in practice.

Byte Order Mark (BOM)

Optional marker indicating encoding and endianness.

Encoding	BOM
UTF-8	EF BB BF
UTF-16LE	FF FE
UTF-16BE	FE FF
UTF-32LE	FF FE 00 00
UTF-32BE	00 00 FE FF

⚠️ UTF-8 BOM can break scripts and APIs.

Normalization Forms

Unicode characters may have multiple representations.

Example:

é → U+00E9
e + ´ → U+0065 U+0301

Normalization forms:

Form	Description
NFC	Composed (recommended)
NFD	Decomposed
NFKC	Compatibility composed
NFKD	Compatibility decomposed

Encodings on the Web

HTML

<meta charset="UTF-8">

HTTP Header

Content-Type: text/html; charset=UTF-8

Programming Languages

Python

s = "ç"
s.encode("utf-8")
s.decode("utf-8")

JavaScript

Uses UTF-16 internally
UTF-8 for I/O

new TextEncoder().encode("🙂")

Java

"ç".getBytes(StandardCharsets.UTF_8);

PHP

mb_internal_encoding("UTF-8");

Databases

Database	Encoding
MySQL	utf8mb4
PostgreSQL	UTF8
SQLite	UTF-8 / UTF-16
MongoDB	UTF-8

⚠️ MySQL utf8 ≠ full Unicode
Use utf8mb4.

Security Considerations

Homoglyph attacks
Encoding confusion
Double encoding vulnerabilities
Incorrect normalization

Best Practices

Always use UTF-8
Explicitly declare encoding
Normalize input
Validate encoding at boundaries
Avoid legacy encodings
Use utf8mb4 in databases

Common Pitfalls

Mixing encodings
Missing charset headers
Assuming ASCII
Copy-paste corruption
BOM-related bugs

Summary

Unicode defines characters
UTF-8 defines how they are stored
UTF-8 is the universal standard
Explicit encoding prevents bugs and vulnerabilities