Gaurav Sarma

In the previous post we looked at how SQLite handles rows that don't fit in a page: it stores the first portion of the overflowing value inline in the B-tree leaf cell and chains the rest through a linked list of overflow pages. The mechanism is simple, predictable, and, if you're not careful about schema design, quietly expensive.

PostgreSQL has the same constraint at its core: a row must fit in a page. Its solution, however, is architecturally different in almost every respect. It compresses values before deciding whether to move them. When it does move them, the entire value goes to a separate storage table, not a linked list. And the cost model at read time is inverted from SQLite's in one critical way.

This post covers how PostgreSQL's mechanism (called TOAST) works under the hood, what it costs, and how it compares to SQLite's approach.

The Constraint

PostgreSQL stores table rows (called tuples) in fixed-size 8KB pages. Every tuple must fit within a single page. Unlike SQLite's 4KB default, PostgreSQL's larger page gives more headroom, but TEXT, BYTEA, JSONB, and other variable-length types can still grow far beyond it. Something has to give.

PostgreSQL doesn't reject the write. It doesn't chain overflow pages either. It applies TOAST (The Oversized-Attribute Storage Technique), which compresses and/or moves large values to a completely separate table. Transparently. At read time it fetches and reconstructs them.

TOAST is not optional. Every PostgreSQL table with any variable-length column automatically has a TOAST table created alongside it. You won't see it in \dt, but it is there.

What TOAST Is

When a value is too large to fit inline, PostgreSQL moves it out-of-line and stores a small pointer in its place. The "out-of-line" destination is a TOAST table: a system-managed heap table with a fixed schema:

| Column       | Type    | Description                                        |
| ------------ | ------- | -------------------------------------------------- |
| `chunk_id`   | OID     | Identifies which large value this chunk belongs to |
| `chunk_seq`  | integer | Ordering of this chunk within the value            |
| `chunk_data` | bytea   | Up to 2000 bytes of the actual value               |

A single large value is split into chunks of up to 2000 bytes each. Each chunk becomes one row in the TOAST table. The TOAST table has its own heap pages and its own B-tree index on (chunk_id, chunk_seq).

You can find the TOAST table for any user table:

SELECT relname, reltoastrelid::regclass AS toast_table
FROM pg_class
WHERE relname = 'your_table' AND reltoastrelid != 0;

The TOAST Threshold

PostgreSQL applies TOAST when a row's total size would exceed ~2KB. But before moving anything out-of-line, it tries compression first:

Compress: try to shrink the value with pglz (or lz4 if configured). If the compressed form is below ~2KB, keep it inline in compressed form; no TOAST table involved at all.
Move out-of-line: if compression doesn't bring it below the threshold, move the entire value to the TOAST table.

This is the first major departure from SQLite. SQLite always stores the first ~4057 bytes of an overflowing value inline and chains the rest. PostgreSQL keeps nothing of the value inline: when it goes out-of-line, the heap tuple holds only an 18-byte TOAST pointer. The whole value, compressed or not, lives in the TOAST table.

SQLite (10KB value):
  B-tree leaf cell: [4057 bytes inline] + [pointer → overflow chain]
  Overflow page 1:  [4092 bytes]
  Overflow page 2:  [remaining bytes]

PostgreSQL (10KB value, EXTENDED):
  Heap tuple: [18-byte TOAST pointer]
  TOAST table: [chunk 0: 2000B] [chunk 1: 2000B] [chunk 2: 2000B] [chunk 3: 2000B] [chunk 4: ~KB]

The Four TOAST Storage Strategies

PostgreSQL lets you configure per column how its values are handled when large. There are four strategies:

PLAIN

No compression, no out-of-line storage. The value is always stored inline. If the tuple won't fit in a page, the write fails. This is used for fixed-width types like INTEGER and DATE, types that can't grow large. You cannot set PLAIN on a variable-length column that can actually exceed the page limit.

EXTENDED (default for most variable-length types)

Compress first, then move out-of-line if still too large. This is the default for TEXT, BYTEA, JSON, JSONB, and most other variable-length types. PostgreSQL:

Tries to compress the value.
If the compressed form is below ~2KB, stores it inline in compressed form.
If still too large, moves the entire value to the TOAST table.

Most values in practice are handled by EXTENDED without you thinking about it. A 5KB JSON document that compresses to 1.8KB never touches the TOAST table at all.

EXTERNAL

No compression, but move out-of-line when large. The value is stored uncompressed in the TOAST table. This is useful when:

The value is already compressed (images, video) and pglz won't help.
You need to run substring operations efficiently. With EXTERNAL, PostgreSQL can fetch only the relevant chunks without decompressing the full value.

ALTER TABLE documents ALTER COLUMN body SET STORAGE EXTERNAL;

-- Reads only the first chunk(s), no decompression needed
SELECT substring(body, 1, 200) FROM documents WHERE id = 42;

MAIN

Compress first, prefer to stay inline even if compressed. Out-of-line storage is a last resort. PostgreSQL will still move the value out-of-line if there's no other way to fit the row, but it will try MAIN columns last when deciding what to evict from the page.

You can change strategy at any time:

ALTER TABLE events ALTER COLUMN metadata SET STORAGE EXTENDED;
ALTER TABLE events ALTER COLUMN payload SET STORAGE EXTERNAL;

The change applies to future writes only. To apply it to existing rows you need a full table rewrite (UPDATE t SET col = col, or pg_repack).

How the TOAST Pointer Works

When a value goes out-of-line, the heap tuple stores an 18-byte TOAST pointer (varattrib_1b_e in the source). It encodes:

Whether the value is compressed or not
The OID of the TOAST table
The chunk_id identifying the value
The original uncompressed length
The on-disk length

When PostgreSQL reads a tuple and encounters a TOAST pointer, it scans the TOAST table's B-tree index on (chunk_id, chunk_seq), retrieves all chunks in order, decompresses if needed, and reconstructs the full value before returning it to the query.

Heap Page (main table)
┌──────────────────────────────────────────────────────────────┐
│ Tuple                                                        │
│   id:       42                                               │
│   name:     "Alice"                                          │
│   metadata: [TOAST ptr → chunk_id=8817, len=52000]          │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
TOAST Table (pg_toast_16384)
┌──────────────────────────────────────────────────────────────┐
│ chunk_id=8817, chunk_seq=0,  chunk_data=[2000 bytes]        │
│ chunk_id=8817, chunk_seq=1,  chunk_data=[2000 bytes]        │
│ ...                                                          │
│ chunk_id=8817, chunk_seq=25, chunk_data=[remaining bytes]   │
└──────────────────────────────────────────────────────────────┘
         ↑ indexed on (chunk_id, chunk_seq)

A 52KB value that doesn't compress well becomes 26 chunks. Reading it requires one TOAST index lookup plus 26 chunk reads: 27 I/Os beyond the initial heap tuple read, repeated for every row in the result set that has a TOASTed column.

Performance Implications

Queries that don't need the column pay nothing

This is the most important difference from SQLite's model.

In SQLite, the record format stores column values end-to-end. To locate column N, the parser must walk through all preceding columns in order. If column 2 is a 100KB TEXT field with 24 overflow pages, a query selecting only column 5 still follows those 24 overflow pages, because it needs to know where column 2 ends to find where column 3 begins.

PostgreSQL's heap tuple format avoids this entirely. Any column can be accessed without reading the others. When a query doesn't project a TOASTed column, PostgreSQL reads the TOAST pointer, recognizes it as a pointer, and discards it: zero additional I/Os.

-- Assuming metadata is TOASTed (50KB):

-- Triggers TOAST reads: 27 I/Os per row
SELECT id, name, metadata FROM events WHERE id = 42;

-- No TOAST reads at all: 1 heap page I/O
SELECT id, name FROM events WHERE id = 42;

Column ordering in the schema has no effect. A TOASTed column at position 2 in a 10-column table costs nothing when queries only project other columns.

`SELECT *` is expensive

Since PostgreSQL skips TOAST unless the column is projected, SELECT * forces a TOAST lookup for every TOASTed column in every returned row. On a table with one 50KB TOASTed column returning 10,000 rows, SELECT * does roughly 270,000 I/Os that SELECT id, name does not. This cost doesn't appear clearly labeled in EXPLAIN ANALYZE: the heap scan time looks reasonable, but the TOAST access happens silently on top.

| Scenario                                | Heap I/O | TOAST I/O |
| --------------------------------------- | -------- | --------- |
| Scan 1M rows, project non-TOAST columns | ~1M      | 0         |
| Scan 1M rows, project 50KB TOAST column | ~1M      | ~26M      |
| Point lookup, project non-TOAST columns | 1        | 0         |
| Point lookup, project 50KB TOAST column | 1        | ~27       |

Transparent decompression has a CPU cost

EXTENDED values that compress below the threshold stay inline (no TOAST table access), but every read incurs a decompression step. For values that compressed from 500KB to 90KB, this is measurable.

The LZ4 compression algorithm (available since PostgreSQL 14) is dramatically faster to decompress than the default pglz, at a modest reduction in compression ratio. For read-heavy workloads, lz4 usually wins:

ALTER TABLE events ALTER COLUMN metadata SET COMPRESSION lz4;

Or globally:

SET default_toast_compression = lz4;

VACUUM must process the TOAST table too

PostgreSQL uses MVCC: when a row is updated, the old version persists until VACUUM removes it. If the row has TOASTed columns, the old version's TOAST chunks also persist. In tables with frequently updated TOASTed columns, the TOAST table accumulates dead rows alongside live ones. Autovacuum handles both, but if it can't keep up, the TOAST table grows without bound.

You can monitor it:

SELECT
  relname,
  n_live_tup,
  n_dead_tup,
  pg_size_pretty(pg_total_relation_size(relid)) AS total_size
FROM pg_stat_user_tables
WHERE relname LIKE 'pg_toast%';

This is a different failure mode than SQLite. SQLite's fragmentation problem (overflow pages scattering throughout the file) is solved by VACUUM rewriting the file sequentially. PostgreSQL's fragmentation problem (dead TOAST rows accumulating) is solved by VACUUM removing dead rows, but live rows on disk are not reordered. Compacting fragmented live data in PostgreSQL requires CLUSTER or pg_repack.

What You Can Do

Never use `SELECT *` on tables with TOASTed columns

Audit your queries and project only the columns you need. This is the single highest-impact change and requires no schema work.

-- Fetches 50KB of TOAST per row
SELECT * FROM events WHERE created_at > now() - interval '1 day';

-- Skips TOAST entirely
SELECT id, event_type, created_at FROM events WHERE created_at > now() - interval '1 day';

Use JSONB instead of JSON for large documents

JSON stores raw text. JSONB stores a parsed binary representation that is smaller and compresses significantly better. A document that would be 50KB as raw JSON might be 30KB as JSONB, then compress to 8KB, which is the difference between 25 TOAST chunks and 4. JSONB also enables GIN indexing for containment queries (@>, ?, ?|) without reading the full value.

ALTER TABLE events ALTER COLUMN metadata TYPE JSONB USING metadata::JSONB;

Switch to lz4 for frequently read large values (Postgres 14+)

pglz prioritizes compression ratio. lz4 prioritizes speed. For any workload that reads large TOASTed values frequently, lz4 reduces decompression CPU by an order of magnitude at a modest cost in compression ratio.

ALTER TABLE events ALTER COLUMN metadata SET COMPRESSION lz4;

The change applies to future writes. Existing rows keep pglz until rewritten.

Use EXTERNAL for columns you substring frequently

EXTERNAL disables compression but enables partial chunk retrieval. If your application frequently reads just the first few hundred bytes of a large column, PostgreSQL can fetch only the relevant chunks, with no decompression or full value reconstruction.

ALTER TABLE documents ALTER COLUMN body SET STORAGE EXTERNAL;

SELECT substring(body, 1, 200) FROM documents WHERE id = 42;

This is only worth it when your access pattern genuinely reads partial values. For workloads that read the full value every time, lack of compression means more chunks and more I/O.

Apply the same schema patterns as SQLite

The schema-level mitigations from the SQLite overflow post apply here too: move large columns to a separate table so they can't be accidentally projected, and store actual binary objects in an object store instead of the database. The reasoning is the same: isolate the cost so it's only paid when explicitly needed.

How It All Fits Together

A query that reads a row with a TOASTed column:

Heap scan / index lookup
      │
      ▼
Heap page (1 I/O)
      │
      │  tuple → [18-byte TOAST ptr: chunk_id=8817, len=52000]
      ▼
TOAST index lookup on (chunk_id=8817, chunk_seq)
      │  (B-tree traversal, likely cached after first access)
      ▼
TOAST heap reads (26 chunk pages)
      │
      ▼
Decompress if EXTENDED (pglz or lz4)
      │
      ▼
Return reconstructed value to query engine

A query that omits the TOASTed column:

Heap page (1 I/O)
      │
      │  tuple → sees TOAST ptr, does not dereference it
      ▼
Return other column values directly
      (0 additional I/Os)

The 18-byte TOAST pointer is what enables this selectivity. PostgreSQL reads it on every tuple access, but follows it only when the query needs the value, something SQLite's inline-then-chain design cannot do.

Comparison Summary

| Dimension                     | SQLite                              | PostgreSQL                                    |
| ----------------------------- | ----------------------------------- | --------------------------------------------- |
| Page size (default)           | 4KB                                 | 8KB                                           |
| Overflow threshold            | ~4057 bytes per cell                | ~2KB (triggers compression); ~8160 (go OOL)  |
| Inline portion when OOL       | First ~4057 bytes stay inline       | Nothing; whole value moves out                |
| Built-in compression          | None                                | pglz or lz4 (EXTENDED strategy)               |
| Storage structure             | Linked list of overflow pages       | Separate heap table with B-tree index         |
| Column selectivity at read    | Must parse preceding columns        | Skip non-projected columns at zero cost       |
| Index overflow threshold      | ~1007 bytes (much tighter)          | No index overflow; long keys are truncated    |
| Fragmentation mechanism       | Overflow pages scatter in file      | Dead TOAST rows accumulate (bloat)            |
| Defragmentation               | VACUUM rewrites file sequentially   | VACUUM removes dead rows; CLUSTER reorders    |
| Tunable storage strategies    | Page size only (global)             | Per-column: PLAIN, EXTENDED, EXTERNAL, MAIN  |

Lessons Learned

Compression often eliminates TOAST I/O entirely. A 5KB JSON document that compresses to 1.8KB stays inline. No TOAST table is ever consulted. SQLite has no equivalent. This is why PostgreSQL often handles large values more gracefully than SQLite even though the mechanisms look similar on the surface.

SELECT * is expensive in a way that is hard to see. TOAST access does not show up separately in EXPLAIN ANALYZE. The heap scan time looks fine; the silent TOAST overhead does not. Explicit column projections are the most important habit to form when working with tables that have large values.

TOAST table bloat is easy to overlook. The TOAST table is invisible in normal tooling. Heavy updates to TOASTed columns can let it grow to multiples of the main table size. Check it explicitly in pg_stat_user_tables.

lz4 is almost always the right choice for large values on Postgres 14+. pglz dates from the 1990s and prioritizes compression ratio over speed. For read-heavy workloads, lz4's faster decompression is worth the marginally larger compressed size.

PostgreSQL's approach is more sophisticated but harder to reason about. SQLite's overflow is mechanical: cells too large chain to overflow pages, you can calculate the cost exactly. PostgreSQL's TOAST involves compression decisions, per-column strategies, a separate heap table with its own B-tree, and MVCC interactions. The transparency is usually helpful. When something is slow, you have to know to look for it.

References

Conclusion

Please reach out to me here for more ideas or improvements.

How PostgreSQL Handles Large Values - TOAST and What It Costs You