Skip to content

Character Encodings Cheat Sheet

Character encoding defines how characters are represented as bytes. Understanding encodings is critical for web applications, APIs, databases, files, internationalization, and security.


Core Concepts

Term Description
Character Human-readable symbol (A, Γ§, δΈ­, πŸ™‚)
Code Point Numeric identifier for a character
Encoding Mapping from code points to bytes
Charset Another term for encoding
Glyph Visual representation of a character

ASCII

ASCII (American Standard Code for Information Interchange)

  • 7-bit encoding
  • 128 characters
  • English-only
Range Description
0–31 Control characters
32–126 Printable characters
127 DEL

Example:

A β†’ 65 β†’ 0x41

⚠️ Cannot represent non-English characters.


Extended ASCII (Non-Standard)

  • 8-bit (256 characters)
  • Multiple incompatible variants

Examples: - ISO-8859-1 - Windows-1252

⚠️ No single standard β†’ common source of bugs.


ISO-8859 Family

Single-byte encodings for specific languages.

Encoding Language Coverage
ISO-8859-1 Western Europe
ISO-8859-2 Central Europe
ISO-8859-5 Cyrillic
ISO-8859-9 Turkish

Limitations: - Max 256 characters - Not multilingual


Unicode (Universal Character Set)

Unicode defines code points, not byte encoding.

Format:

U+XXXX

Examples:

A      β†’ U+0041
Γ§      β†’ U+00E7
δΈ­     β†’ U+4E2D
πŸ™‚     β†’ U+1F642

Unicode supports: - All modern languages - Symbols & emojis - Mathematical notation


UTF-8 (Web Standard)

UTF-8 is the most widely used encoding.

Characteristics: - Variable length (1–4 bytes) - ASCII-compatible - Backward compatible - No endianness issues

Bytes Range
1 U+0000 – U+007F
2 U+0080 – U+07FF
3 U+0800 – U+FFFF
4 U+10000 – U+10FFFF

Example:

A  β†’ 41
Γ§  β†’ C3 A7
πŸ™‚ β†’ F0 9F 99 82

βœ… Recommended default encoding


UTF-16

  • Variable length (2 or 4 bytes)
  • Uses surrogate pairs
  • Endianness matters

Variants: - UTF-16LE - UTF-16BE

Example:

πŸ™‚ β†’ D83D DE42

Used in: - Java - Windows APIs


UTF-32

  • Fixed length (4 bytes)
  • Simple indexing
  • Very memory-inefficient

Example:

A β†’ 00 00 00 41

Rarely used in practice.


Byte Order Mark (BOM)

Optional marker indicating encoding and endianness.

Encoding BOM
UTF-8 EF BB BF
UTF-16LE FF FE
UTF-16BE FE FF
UTF-32LE FF FE 00 00
UTF-32BE 00 00 FE FF

⚠️ UTF-8 BOM can break scripts and APIs.


Normalization Forms

Unicode characters may have multiple representations.

Example:

Γ© β†’ U+00E9
e + Β΄ β†’ U+0065 U+0301

Normalization forms:

Form Description
NFC Composed (recommended)
NFD Decomposed
NFKC Compatibility composed
NFKD Compatibility decomposed

Encodings on the Web

HTML

<meta charset="UTF-8">

HTTP Header

Content-Type: text/html; charset=UTF-8

Programming Languages

Python

s = "Γ§"
s.encode("utf-8")
s.decode("utf-8")

JavaScript

  • Uses UTF-16 internally
  • UTF-8 for I/O
new TextEncoder().encode("πŸ™‚")

Java

"Γ§".getBytes(StandardCharsets.UTF_8);

PHP

mb_internal_encoding("UTF-8");

Databases

Database Encoding
MySQL utf8mb4
PostgreSQL UTF8
SQLite UTF-8 / UTF-16
MongoDB UTF-8

⚠️ MySQL utf8 β‰  full Unicode
Use utf8mb4.


Security Considerations

  • Homoglyph attacks
  • Encoding confusion
  • Double encoding vulnerabilities
  • Incorrect normalization

Best Practices

  • Always use UTF-8
  • Explicitly declare encoding
  • Normalize input
  • Validate encoding at boundaries
  • Avoid legacy encodings
  • Use utf8mb4 in databases

Common Pitfalls

  • Mixing encodings
  • Missing charset headers
  • Assuming ASCII
  • Copy-paste corruption
  • BOM-related bugs

Summary

  • Unicode defines characters
  • UTF-8 defines how they are stored
  • UTF-8 is the universal standard
  • Explicit encoding prevents bugs and vulnerabilities