ClickHouse Cheat Sheet
ClickHouse is a column-oriented OLAP database optimized for fast analytical queries over large datasets. This cheatsheet focuses on production patterns: modeling with MergeTree, query/ingest patterns, and performance/ops.
CLI & Connection
clickhouse-client
clickhouse-client
clickhouse-client --host <host> --port 9000 --user <user> --password
clickhouse-client --query "SELECT version();"
HTTP interface (common for integrations)
curl -sS "http://localhost:8123/?query=SELECT%201"
Database & DDL Basics
SHOW DATABASES;
CREATE DATABASE IF NOT EXISTS analytics;
USE analytics;
SHOW TABLES;
DESCRIBE TABLE analytics.events;
SHOW CREATE TABLE analytics.events;
DROP TABLE IF EXISTS analytics.events;
Data Types (Practical + Performance)
Numeric
Int8/16/32/64,UInt8/16/32/64Float32/64Decimal(p, s)for money
Date/Time
Date,Date32DateTime,DateTime64(3)(ms precision)
Strings
String(binary-safe)FixedString(N)(rare; fixed-length)LowCardinality(String)for repeated values (often a win for dimensions)
Complex
Array(T),Tuple(...),Map(K, V),Nested(...)UUIDEnum8/Enum16for compact categorical valuesNullable(T)(avoid where possible for performance; prefer defaults)
MergeTree Modeling (The Important Part)
Core ideas (how ClickHouse “indexes”)
For MergeTree-family engines, ClickHouse stores data in parts, sorted by a key.
The main query acceleration mechanism is:
- sorting key (ORDER BY) + sparse index marks
- partition pruning (PARTITION BY)
- optional data skipping indexes, projections, and PREWHERE
Minimal MergeTree table
CREATE TABLE analytics.events
(
event_date Date,
ts DateTime64(3),
user_id UInt64,
event_type LowCardinality(String),
url String,
props String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, event_type, user_id, ts);
ORDER BY vs PRIMARY KEY
ORDER BYdefines the sorting key (how data is physically ordered).PRIMARY KEYdefines the primary index key (sparse index) and is often the same asORDER BY. If omitted, many setups effectively use the sorting key as the primary key.
A common pattern:
- Put the most selective and frequently filtered columns early in ORDER BY
- Keep ORDER BY stable; changing it later is expensive
Partitioning
Partitioning helps with: - pruning old data fast - faster drops/retention - keeping merges manageable
Patterns:
PARTITION BY toYYYYMM(event_date) -- time series (monthly)
PARTITION BY toDate(ts) -- daily partitions
PARTITION BY (tenant_id, toYYYYMM(ts)) -- multi-tenant
TTL (Retention / Tiering)
ALTER TABLE analytics.events
MODIFY TTL event_date + INTERVAL 90 DAY DELETE;
-- tiering (example volumes vary by config)
ALTER TABLE analytics.events
MODIFY TTL event_date + INTERVAL 7 DAY TO VOLUME 'cold';
Compression codecs (per-column)
CREATE TABLE analytics.metrics
(
ts DateTime,
host LowCardinality(String),
cpu Float32 CODEC(ZSTD(3)),
mem Float32 CODEC(ZSTD(3)),
msg String CODEC(ZSTD(6))
)
ENGINE = MergeTree
ORDER BY (host, ts);
Inserts & Ingest Patterns
Basic insert
INSERT INTO analytics.events (event_date, ts, user_id, event_type, url, props)
VALUES ('2025-12-27', now64(3), 42, 'page_view', '/home', '{}');
Insert from SELECT
INSERT INTO analytics.events
SELECT
toDate(ts) AS event_date,
ts, user_id, event_type, url, props
FROM staging.raw_events;
Bulk inserts with formats
# JSONEachRow
cat events.jsonl | clickhouse-client --query="INSERT INTO analytics.events FORMAT JSONEachRow"
# CSV
cat events.csv | clickhouse-client --query="INSERT INTO analytics.events FORMAT CSV"
Insert settings
- Prefer bigger batches for throughput
- Consider asynchronous inserts (if enabled in your deployment) for spiky writes
SELECT Patterns (ClickHouse-style)
LIMIT BY (top-N per group)
SELECT *
FROM analytics.events
ORDER BY ts DESC
LIMIT 5 BY user_id;
PREWHERE (read fewer columns sooner)
SELECT user_id, ts, url
FROM analytics.events
PREWHERE event_date >= today() - 7
WHERE event_type = 'page_view';
SAMPLE (requires SAMPLE BY in table definition)
SELECT count()
FROM analytics.events
SAMPLE 0.1;
FINAL (forces merge of collapsing/versioned engines; expensive)
SELECT *
FROM analytics.some_collapsing_table FINAL
WHERE ...
;
Filtering & Functions (Useful “Advanced”)
Conditional aggregation
SELECT
event_date,
count() AS total,
countIf(event_type = 'purchase') AS purchases,
uniqExactIf(user_id, event_type = 'purchase') AS buyers
FROM analytics.events
WHERE event_date >= today() - 30
GROUP BY event_date
ORDER BY event_date;
Arrays
SELECT
user_id,
arrayDistinct(groupArray(event_type)) AS seen_events
FROM analytics.events
GROUP BY user_id
LIMIT 100;
Window functions
SELECT
user_id,
ts,
row_number() OVER (PARTITION BY user_id ORDER BY ts) AS rn
FROM analytics.events;
Joins (Choose the right tool)
Standard JOIN (mind memory)
SELECT e.user_id, e.ts, u.country
FROM analytics.events e
JOIN analytics.users u ON e.user_id = u.user_id
WHERE e.event_date >= today() - 7;
Join alternatives
- Dictionaries for small/medium dimension lookups (often faster, less memory pressure)
- Pre-aggregate / denormalize into fact tables
- Use
ANYjoins when appropriate for one-to-one semantics
Aggregation & Rollups
SELECT
event_date,
event_type,
count()
FROM analytics.events
GROUP BY
event_date,
event_type
WITH ROLLUP;
Indexing in ClickHouse (What actually helps)
1) Sorting key (ORDER BY) is the “main index”
Design it for your most frequent query patterns.
2) Data skipping indexes (secondary)
Use when: - filters hit a column not early in the sorting key - values are selective enough
Common types you’ll see:
- minmax (range pruning for numeric/date)
- set (membership)
- bloom_filter / tokenbf_v1 / ngrambf_v1 (strings/search-ish)
Example:
ALTER TABLE analytics.events
ADD INDEX idx_url_bf url TYPE bloom_filter(0.01) GRANULARITY 4;
ALTER TABLE analytics.events
MATERIALIZE INDEX idx_url_bf;
3) Projections (pre-computed alternative layouts)
Useful when a subset of queries repeatedly needs a different ordering/aggregation.
ALTER TABLE analytics.events
ADD PROJECTION p_by_type
(
SELECT event_type, event_date, count() AS c
GROUP BY event_type, event_date
);
ALTER TABLE analytics.events MATERIALIZE PROJECTION p_by_type;
Materialized Views (Real-time derived tables)
MV into a target table
CREATE TABLE analytics.daily_counts
(
event_date Date,
event_type LowCardinality(String),
c UInt64
)
ENGINE = SummingMergeTree
ORDER BY (event_date, event_type);
CREATE MATERIALIZED VIEW analytics.mv_daily_counts
TO analytics.daily_counts
AS
SELECT
event_date,
event_type,
count() AS c
FROM analytics.events
GROUP BY event_date, event_type;
Notes:
- MV triggers on inserts into the source table
- Backfilling requires INSERT INTO ... SELECT ... into the source, or manual rebuild
Mutations (UPDATE/DELETE) — Powerful but Heavy
ClickHouse is optimized for append-heavy analytics. UPDATE/DELETE are implemented as mutations and run asynchronously.
ALTER TABLE analytics.events
UPDATE props = '{"redacted":true}'
WHERE user_id = 42 AND event_date >= today() - 7;
ALTER TABLE analytics.events
DELETE WHERE event_type = 'debug';
Monitor:
SELECT *
FROM system.mutations
WHERE database = 'analytics' AND table = 'events'
ORDER BY create_time DESC;
Query Analysis & Performance Tuning
EXPLAIN
EXPLAIN SELECT count() FROM analytics.events WHERE event_date >= today() - 7;
EXPLAIN indexes = 1
SELECT count()
FROM analytics.events
WHERE event_date >= today() - 7 AND event_type = 'purchase';
Query logs
SELECT
event_time,
query_duration_ms,
read_rows,
read_bytes,
result_rows,
memory_usage,
query
FROM system.query_log
WHERE type = 'QueryFinish'
AND event_time >= now() - INTERVAL 1 HOUR
ORDER BY event_time DESC
LIMIT 50;
Parts and merges
SELECT
partition,
count() AS parts,
sum(rows) AS rows,
formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE database = 'analytics' AND table = 'events' AND active
GROUP BY partition
ORDER BY partition;
Cluster, Sharding, Replication
Distributed table (logical view over shards)
CREATE TABLE analytics.events_dist AS analytics.events
ENGINE = Distributed('my_cluster', 'analytics', 'events', rand());
ReplicatedMergeTree (replicated storage)
CREATE TABLE analytics.events_rep
(
event_date Date,
ts DateTime,
user_id UInt64
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/analytics/events_rep', '{replica}')
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id, ts);
Useful system views:
SELECT * FROM system.clusters;
SELECT * FROM system.replicas WHERE database = 'analytics';
Backups & Restore (SQL-level)
BACKUP TABLE analytics.events TO Disk('backups', 'bk_2025_12_27/');
RESTORE TABLE analytics.events FROM Disk('backups', 'bk_2025_12_27/');
BACKUP TABLE analytics.events PARTITION '202512' TO Disk('backups', 'bk_part/');
RESTORE TABLE analytics.events PARTITION '202512' FROM Disk('backups', 'bk_part/');
RBAC (Users, Roles, Grants)
CREATE USER IF NOT EXISTS analyst IDENTIFIED WITH sha256_password BY 'StrongPass';
CREATE ROLE IF NOT EXISTS readonly;
GRANT SELECT ON analytics.* TO readonly;
GRANT readonly TO analyst;
SHOW GRANTS FOR analyst;
Handy System Tables (Ops Toolkit)
SHOW TABLES FROM system;
SELECT * FROM system.tables WHERE database = 'analytics';
SELECT * FROM system.columns WHERE database = 'analytics' AND table = 'events';
SELECT * FROM system.metrics;
SELECT * FROM system.events;
SELECT * FROM system.processes ORDER BY elapsed DESC;