Skip to main content
This guide covers common monitoring scenarios with ready-to-use SQL queries. Each section starts with the question you’re trying to answer, then provides the query.

Find slow queries

Top queries by tail latency

Use the pre-aggregated query_stats_5m view to find the worst p99 latencies without scanning raw events:
SELECT
    query_id,
    cmd_type,
    countMerge(calls_state) AS calls,
    round(sumMerge(duration_sum_state) / countMerge(calls_state) / 1000, 2) AS avg_ms,
    round(quantilesTDigestMerge(0.95, 0.99)(duration_q_state)[1] / 1000, 2) AS p95_ms,
    round(quantilesTDigestMerge(0.95, 0.99)(duration_q_state)[2] / 1000, 2) AS p99_ms
FROM pg_stat_ch.query_stats_5m
WHERE bucket >= now() - INTERVAL 1 HOUR
GROUP BY query_id, cmd_type
ORDER BY p99_ms DESC
LIMIT 10;

Latency trend for a specific query

After identifying a slow query_id, see how its performance changes over time. This is something pg_stat_statements cannot do since it only stores cumulative counters.
SELECT
    toStartOfFiveMinutes(ts_start) AS bucket,
    count() AS calls,
    round(avg(duration_us) / 1000, 2) AS avg_ms,
    round(quantile(0.95)(duration_us) / 1000, 2) AS p95_ms
FROM pg_stat_ch.events_raw
WHERE query_id = 14460383662181259114  -- replace with your query_id
  AND ts_start > now() - INTERVAL 24 HOUR
GROUP BY bucket
ORDER BY bucket;

Individual slow executions

Find the specific query executions that were slowest:
SELECT
    ts_start,
    duration_us / 1000 AS ms,
    rows,
    db,
    username,
    app,
    substring(query, 1, 200) AS query_preview
FROM pg_stat_ch.events_raw
WHERE duration_us > 1000000  -- > 1 second
  AND ts_start > now() - INTERVAL 1 HOUR
ORDER BY duration_us DESC
LIMIT 20;

Diagnose cache problems

Cache miss outliers

Find individual executions with the most disk reads:
SELECT
    ts_start,
    query_id,
    shared_blks_read,
    shared_blks_hit,
    round(100 * shared_blks_read / (shared_blks_hit + shared_blks_read), 2) AS miss_pct,
    duration_us / 1000 AS ms,
    substring(query, 1, 100) AS query_preview
FROM pg_stat_ch.events_raw
WHERE shared_blks_read > 100
  AND ts_start > now() - INTERVAL 1 HOUR
ORDER BY shared_blks_read DESC
LIMIT 20;

Cache hit ratio trend

Track cache efficiency over time:
SELECT
    toStartOfFiveMinutes(ts_start) AS bucket,
    count() AS queries,
    sum(shared_blks_hit) AS hits,
    sum(shared_blks_read) AS reads,
    round(100 * sum(shared_blks_hit) / (sum(shared_blks_hit) + sum(shared_blks_read) + 1), 2) AS hit_ratio
FROM pg_stat_ch.events_raw
WHERE ts_start > now() - INTERVAL 24 HOUR
  AND (shared_blks_hit + shared_blks_read) > 0
GROUP BY bucket
ORDER BY bucket;

Queries spilling to temp files

Non-zero temp_blks_written means sorts or hashes are exceeding work_mem:
SELECT
    query_id,
    count() AS executions,
    round(avg(temp_blks_written), 0) AS avg_temp_blks,
    round(avg(duration_us) / 1000, 2) AS avg_ms,
    any(substring(query, 1, 150)) AS sample_query
FROM pg_stat_ch.events_raw
WHERE temp_blks_written > 0
  AND ts_start > now() - INTERVAL 1 HOUR
GROUP BY query_id
ORDER BY avg_temp_blks DESC
LIMIT 10;

Track errors

Errors by SQLSTATE

SELECT
    err_sqlstate,
    count() AS errors,
    uniq(query_id) AS unique_queries,
    any(err_message) AS sample_message
FROM pg_stat_ch.events_raw
WHERE err_elevel >= 21  -- ERROR and above
  AND ts_start > now() - INTERVAL 24 HOUR
GROUP BY err_sqlstate
ORDER BY errors DESC;

Error rate over time

SELECT
    toStartOfFiveMinutes(ts_start) AS bucket,
    count() AS total_queries,
    countIf(err_elevel >= 21) AS errors,
    round(100 * countIf(err_elevel >= 21) / count(), 2) AS error_pct
FROM pg_stat_ch.events_raw
WHERE ts_start > now() - INTERVAL 24 HOUR
GROUP BY bucket
ORDER BY bucket;

Errors for a specific query

SELECT
    ts_start,
    err_sqlstate,
    err_elevel,
    err_message,
    substring(query, 1, 200) AS query_preview
FROM pg_stat_ch.errors_recent
WHERE query_id = 14460383662181259114  -- replace with your query_id
ORDER BY ts_start DESC
LIMIT 20;

Understand load patterns

QPS over time

SELECT
    bucket,
    countMerge(calls_state) / 300 AS qps
FROM pg_stat_ch.query_stats_5m
WHERE bucket >= now() - INTERVAL 24 HOUR
GROUP BY bucket
ORDER BY bucket;

Load by application

Rank applications by total query time to find the heaviest consumers:
SELECT
    app,
    countMerge(calls_state) AS total_queries,
    round(sumMerge(duration_sum_state) / 1000000, 2) AS total_seconds,
    round(quantilesTDigestMerge(0.95, 0.99)(duration_q_state)[2] / 1000, 2) AS p99_ms,
    sumMerge(errors_sum_state) AS errors
FROM pg_stat_ch.db_app_user_1m
WHERE bucket >= now() - INTERVAL 24 HOUR
GROUP BY app
ORDER BY total_seconds DESC;

Workload breakdown by command type

SELECT
    cmd_type,
    count() AS queries,
    round(100 * count() / sum(count()) OVER (), 1) AS pct,
    round(avg(duration_us) / 1000, 2) AS avg_ms,
    sum(rows) AS total_rows
FROM pg_stat_ch.events_raw
WHERE ts_start > now() - INTERVAL 1 HOUR
GROUP BY cmd_type
ORDER BY queries DESC;

Monitor WAL and write activity

WAL generation rate

SELECT
    toStartOfMinute(ts_start) AS bucket,
    sum(wal_bytes) AS wal_bytes,
    round(sum(wal_bytes) / 1048576, 2) AS wal_mb,
    sum(wal_fpi) AS full_page_images
FROM pg_stat_ch.events_raw
WHERE cmd_type IN ('INSERT', 'UPDATE', 'DELETE')
  AND ts_start > now() - INTERVAL 24 HOUR
GROUP BY bucket
ORDER BY bucket;
Full-page images (FPI) spike after each checkpoint then drop until the next one. This sawtooth pattern is visible in per-execution data but invisible in pg_stat_statements.

Dirty blocks over time

Buffer write pressure by block type. Spikes in shared blocks indicate write-heavy batches; non-zero local/temp indicate temp table or work_mem spill activity.
SELECT
    toStartOfMinute(ts_start) AS bucket,
    sum(shared_blks_dirtied) AS shared_dirtied,
    sum(local_blks_dirtied) AS local_dirtied,
    sum(temp_blks_written) AS temp_written
FROM pg_stat_ch.events_raw
WHERE ts_start > now() - INTERVAL 24 HOUR
GROUP BY bucket
ORDER BY bucket;

Analyze parallel query usage

Available on PostgreSQL 18+.

Queries where workers couldn’t be launched

SELECT
    query_id,
    cmd_type,
    count() AS executions,
    round(avg(parallel_workers_planned), 1) AS avg_planned,
    round(avg(parallel_workers_launched), 1) AS avg_launched,
    round(avg(parallel_workers_planned - parallel_workers_launched), 1) AS avg_missed
FROM pg_stat_ch.events_raw
WHERE parallel_workers_planned > parallel_workers_launched
  AND ts_start > now() - INTERVAL 24 HOUR
GROUP BY query_id, cmd_type
ORDER BY avg_missed DESC
LIMIT 10;
If avg_missed is consistently high, consider increasing max_parallel_workers or max_worker_processes.

Analyze JIT compilation

Queries where JIT overhead exceeds benefit

SELECT
    query_id,
    count() AS executions,
    round(avg(duration_us) / 1000, 2) AS avg_total_ms,
    round(avg(jit_generation_time_us + jit_inlining_time_us +
              jit_optimization_time_us + jit_emission_time_us) / 1000, 2) AS avg_jit_ms,
    round(100 * avg(jit_generation_time_us + jit_inlining_time_us +
                    jit_optimization_time_us + jit_emission_time_us) /
          greatest(avg(duration_us), 1), 1) AS jit_pct_of_total
FROM pg_stat_ch.events_raw
WHERE jit_functions > 0
  AND ts_start > now() - INTERVAL 24 HOUR
GROUP BY query_id
HAVING avg_jit_ms > 10  -- only queries with significant JIT time
ORDER BY jit_pct_of_total DESC
LIMIT 10;
If JIT compilation takes a large percentage of total execution time for frequently-run queries, consider raising jit_above_cost to prevent JIT for those queries.