Skip to content

ClickHouse Cheat Sheet

ClickHouse is a column-oriented OLAP database optimized for fast analytical queries over large datasets. This cheatsheet focuses on production patterns: modeling with MergeTree, query/ingest patterns, and performance/ops.


CLI & Connection

clickhouse-client

clickhouse-client
clickhouse-client --host <host> --port 9000 --user <user> --password
clickhouse-client --query "SELECT version();"

HTTP interface (common for integrations)

curl -sS "http://localhost:8123/?query=SELECT%201"

Database & DDL Basics

SHOW DATABASES;
CREATE DATABASE IF NOT EXISTS analytics;
USE analytics;

SHOW TABLES;
DESCRIBE TABLE analytics.events;
SHOW CREATE TABLE analytics.events;
DROP TABLE IF EXISTS analytics.events;

Data Types (Practical + Performance)

Numeric

  • Int8/16/32/64, UInt8/16/32/64
  • Float32/64
  • Decimal(p, s) for money

Date/Time

  • Date, Date32
  • DateTime, DateTime64(3) (ms precision)

Strings

  • String (binary-safe)
  • FixedString(N) (rare; fixed-length)
  • LowCardinality(String) for repeated values (often a win for dimensions)

Complex

  • Array(T), Tuple(...), Map(K, V), Nested(...)
  • UUID
  • Enum8/Enum16 for compact categorical values
  • Nullable(T) (avoid where possible for performance; prefer defaults)

MergeTree Modeling (The Important Part)

Core ideas (how ClickHouse “indexes”)

For MergeTree-family engines, ClickHouse stores data in parts, sorted by a key. The main query acceleration mechanism is: - sorting key (ORDER BY) + sparse index marks - partition pruning (PARTITION BY) - optional data skipping indexes, projections, and PREWHERE

Minimal MergeTree table

CREATE TABLE analytics.events
(
  event_date Date,
  ts DateTime64(3),
  user_id UInt64,
  event_type LowCardinality(String),
  url String,
  props String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, event_type, user_id, ts);

ORDER BY vs PRIMARY KEY

  • ORDER BY defines the sorting key (how data is physically ordered).
  • PRIMARY KEY defines the primary index key (sparse index) and is often the same as ORDER BY. If omitted, many setups effectively use the sorting key as the primary key.

A common pattern: - Put the most selective and frequently filtered columns early in ORDER BY - Keep ORDER BY stable; changing it later is expensive

Partitioning

Partitioning helps with: - pruning old data fast - faster drops/retention - keeping merges manageable

Patterns:

PARTITION BY toYYYYMM(event_date)        -- time series (monthly)
PARTITION BY toDate(ts)                  -- daily partitions
PARTITION BY (tenant_id, toYYYYMM(ts))   -- multi-tenant

TTL (Retention / Tiering)

ALTER TABLE analytics.events
  MODIFY TTL event_date + INTERVAL 90 DAY DELETE;

-- tiering (example volumes vary by config)
ALTER TABLE analytics.events
  MODIFY TTL event_date + INTERVAL 7 DAY TO VOLUME 'cold';

Compression codecs (per-column)

CREATE TABLE analytics.metrics
(
  ts DateTime,
  host LowCardinality(String),
  cpu Float32 CODEC(ZSTD(3)),
  mem Float32 CODEC(ZSTD(3)),
  msg String CODEC(ZSTD(6))
)
ENGINE = MergeTree
ORDER BY (host, ts);

Inserts & Ingest Patterns

Basic insert

INSERT INTO analytics.events (event_date, ts, user_id, event_type, url, props)
VALUES ('2025-12-27', now64(3), 42, 'page_view', '/home', '{}');

Insert from SELECT

INSERT INTO analytics.events
SELECT
  toDate(ts) AS event_date,
  ts, user_id, event_type, url, props
FROM staging.raw_events;

Bulk inserts with formats

# JSONEachRow
cat events.jsonl | clickhouse-client --query="INSERT INTO analytics.events FORMAT JSONEachRow"

# CSV
cat events.csv | clickhouse-client --query="INSERT INTO analytics.events FORMAT CSV"

Insert settings

  • Prefer bigger batches for throughput
  • Consider asynchronous inserts (if enabled in your deployment) for spiky writes

SELECT Patterns (ClickHouse-style)

LIMIT BY (top-N per group)

SELECT *
FROM analytics.events
ORDER BY ts DESC
LIMIT 5 BY user_id;

PREWHERE (read fewer columns sooner)

SELECT user_id, ts, url
FROM analytics.events
PREWHERE event_date >= today() - 7
WHERE event_type = 'page_view';

SAMPLE (requires SAMPLE BY in table definition)

SELECT count()
FROM analytics.events
SAMPLE 0.1;

FINAL (forces merge of collapsing/versioned engines; expensive)

SELECT *
FROM analytics.some_collapsing_table FINAL
WHERE ...
;

Filtering & Functions (Useful “Advanced”)

Conditional aggregation

SELECT
  event_date,
  count() AS total,
  countIf(event_type = 'purchase') AS purchases,
  uniqExactIf(user_id, event_type = 'purchase') AS buyers
FROM analytics.events
WHERE event_date >= today() - 30
GROUP BY event_date
ORDER BY event_date;

Arrays

SELECT
  user_id,
  arrayDistinct(groupArray(event_type)) AS seen_events
FROM analytics.events
GROUP BY user_id
LIMIT 100;

Window functions

SELECT
  user_id,
  ts,
  row_number() OVER (PARTITION BY user_id ORDER BY ts) AS rn
FROM analytics.events;

Joins (Choose the right tool)

Standard JOIN (mind memory)

SELECT e.user_id, e.ts, u.country
FROM analytics.events e
JOIN analytics.users u ON e.user_id = u.user_id
WHERE e.event_date >= today() - 7;

Join alternatives

  • Dictionaries for small/medium dimension lookups (often faster, less memory pressure)
  • Pre-aggregate / denormalize into fact tables
  • Use ANY joins when appropriate for one-to-one semantics

Aggregation & Rollups

SELECT
  event_date,
  event_type,
  count()
FROM analytics.events
GROUP BY
  event_date,
  event_type
WITH ROLLUP;

Indexing in ClickHouse (What actually helps)

1) Sorting key (ORDER BY) is the “main index”

Design it for your most frequent query patterns.

2) Data skipping indexes (secondary)

Use when: - filters hit a column not early in the sorting key - values are selective enough

Common types you’ll see: - minmax (range pruning for numeric/date) - set (membership) - bloom_filter / tokenbf_v1 / ngrambf_v1 (strings/search-ish)

Example:

ALTER TABLE analytics.events
  ADD INDEX idx_url_bf url TYPE bloom_filter(0.01) GRANULARITY 4;

ALTER TABLE analytics.events
  MATERIALIZE INDEX idx_url_bf;

3) Projections (pre-computed alternative layouts)

Useful when a subset of queries repeatedly needs a different ordering/aggregation.

ALTER TABLE analytics.events
  ADD PROJECTION p_by_type
  (
    SELECT event_type, event_date, count() AS c
    GROUP BY event_type, event_date
  );

ALTER TABLE analytics.events MATERIALIZE PROJECTION p_by_type;


Materialized Views (Real-time derived tables)

MV into a target table

CREATE TABLE analytics.daily_counts
(
  event_date Date,
  event_type LowCardinality(String),
  c UInt64
)
ENGINE = SummingMergeTree
ORDER BY (event_date, event_type);

CREATE MATERIALIZED VIEW analytics.mv_daily_counts
TO analytics.daily_counts
AS
SELECT
  event_date,
  event_type,
  count() AS c
FROM analytics.events
GROUP BY event_date, event_type;

Notes: - MV triggers on inserts into the source table - Backfilling requires INSERT INTO ... SELECT ... into the source, or manual rebuild


Mutations (UPDATE/DELETE) — Powerful but Heavy

ClickHouse is optimized for append-heavy analytics. UPDATE/DELETE are implemented as mutations and run asynchronously.

ALTER TABLE analytics.events
  UPDATE props = '{"redacted":true}'
  WHERE user_id = 42 AND event_date >= today() - 7;

ALTER TABLE analytics.events
  DELETE WHERE event_type = 'debug';

Monitor:

SELECT *
FROM system.mutations
WHERE database = 'analytics' AND table = 'events'
ORDER BY create_time DESC;


Query Analysis & Performance Tuning

EXPLAIN

EXPLAIN SELECT count() FROM analytics.events WHERE event_date >= today() - 7;

EXPLAIN indexes = 1
SELECT count()
FROM analytics.events
WHERE event_date >= today() - 7 AND event_type = 'purchase';

Query logs

SELECT
  event_time,
  query_duration_ms,
  read_rows,
  read_bytes,
  result_rows,
  memory_usage,
  query
FROM system.query_log
WHERE type = 'QueryFinish'
  AND event_time >= now() - INTERVAL 1 HOUR
ORDER BY event_time DESC
LIMIT 50;

Parts and merges

SELECT
  partition,
  count() AS parts,
  sum(rows) AS rows,
  formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts
WHERE database = 'analytics' AND table = 'events' AND active
GROUP BY partition
ORDER BY partition;

Cluster, Sharding, Replication

Distributed table (logical view over shards)

CREATE TABLE analytics.events_dist AS analytics.events
ENGINE = Distributed('my_cluster', 'analytics', 'events', rand());

ReplicatedMergeTree (replicated storage)

CREATE TABLE analytics.events_rep
(
  event_date Date,
  ts DateTime,
  user_id UInt64
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/analytics/events_rep', '{replica}')
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id, ts);

Useful system views:

SELECT * FROM system.clusters;
SELECT * FROM system.replicas WHERE database = 'analytics';


Backups & Restore (SQL-level)

BACKUP TABLE analytics.events TO Disk('backups', 'bk_2025_12_27/');
RESTORE TABLE analytics.events FROM Disk('backups', 'bk_2025_12_27/');

BACKUP TABLE analytics.events PARTITION '202512' TO Disk('backups', 'bk_part/');
RESTORE TABLE analytics.events PARTITION '202512' FROM Disk('backups', 'bk_part/');

RBAC (Users, Roles, Grants)

CREATE USER IF NOT EXISTS analyst IDENTIFIED WITH sha256_password BY 'StrongPass';
CREATE ROLE IF NOT EXISTS readonly;

GRANT SELECT ON analytics.* TO readonly;
GRANT readonly TO analyst;

SHOW GRANTS FOR analyst;

Handy System Tables (Ops Toolkit)

SHOW TABLES FROM system;

SELECT * FROM system.tables WHERE database = 'analytics';
SELECT * FROM system.columns WHERE database = 'analytics' AND table = 'events';

SELECT * FROM system.metrics;
SELECT * FROM system.events;

SELECT * FROM system.processes ORDER BY elapsed DESC;