Gaurav Sarma

You migrate a collection from MySQL to MongoDB expecting simpler operations and schema flexibility. Reads are fast at first. Then, as documents grow with nested arrays and embedded objects, some queries start taking ten times longer than expected. The collection isn't huge. The indexes are there. explain() shows the index is being used. What's happening underneath is that MongoDB's storage engine is managing pages, overflow references, reconciliation, and cache pressure in ways that have real costs, and those costs are invisible until you understand the storage model.

This post covers how WiredTiger, MongoDB's default storage engine since version 3.2, actually stores data on disk and in memory. It then compares that model directly to SQLite's fixed-page B-tree, which we covered in detail in the SQLite overflow pages post.

Prerequisites

Familiarity with the SQLite storage model: fixed pages, B-trees, overflow chains. Read SQLite Overflow Pages - When Your Rows Don't Fit first if you haven't.
Basic understanding of what BSON is and how MongoDB documents are structured
Awareness of B-trees and B+ trees as data structures
Awareness of I/O and cache as performance concerns in databases

SQLite's Model in One Paragraph

SQLite stores everything in a flat array of fixed-size pages, 4KB by default. Each page is a node in a B-tree. Leaf pages hold rows packed as cells from the bottom up. When a row's data exceeds the per-cell threshold (roughly 4057 bytes on a 4KB page), SQLite stores the first portion inline and chains the remainder through a linked list of overflow pages. The entire file, from header to last page, uses the same fixed page size. There is no compression. The on-disk format and the in-memory format are essentially the same: the page cache holds exact copies of on-disk pages, unmodified in structure.

That model is simple and predictable. WiredTiger is neither.

WiredTiger's Foundations

WiredTiger is a general-purpose key-value storage engine. MongoDB uses it to store each collection as a WiredTiger B-tree, where the key is the document's _id field (serialized as BSON) and the value is the full document serialized as BSON. Indexes are stored as separate WiredTiger B-trees.

Two things set WiredTiger apart from SQLite's storage model from the start:

Variable-size pages. WiredTiger does not use a single fixed page size for the entire database. Internal (non-leaf) pages default to 4KB. Leaf pages default to 32KB. These are configured per collection, not globally, and can be changed at collection creation time. The larger default leaf page size reflects WiredTiger's expectation that documents are bigger and more varied than SQLite rows.

Separate in-memory and on-disk formats. SQLite's page cache holds exact copies of on-disk pages. WiredTiger does not. When a page is read from disk into the WiredTiger cache, it is decompressed and transformed into an in-memory representation that is structurally different from what's on disk. When a dirty page needs to be written back, it goes through a process called reconciliation that re-serializes and re-compresses the in-memory state into the on-disk format. This split is fundamental to how WiredTiger achieves its performance characteristics.

The On-Disk Page Format

Every WiredTiger page on disk starts with two headers, then contains a sequence of cells.

The page header is 28 bytes and contains: the page type, the number of entries, the page's logical record count, and two checksums. The block header immediately follows and is 12 bytes, containing the on-disk size, the uncompressed size (for decompression), and a checksum.

After the headers comes the cell data. Each cell encodes one key or one value using a compact variable-length format. A 1-byte cell descriptor encodes the cell type and, for short values, the length inline. Longer values use additional length bytes.

A WiredTiger leaf page for a MongoDB collection looks like this:

┌──────────────────────────────────────────────────────────────────┐
│ Page Header (28 bytes)                                           │
│   type=WT_PAGE_ROW_LEAF, entries=N, checksum=...                 │
├──────────────────────────────────────────────────────────────────┤
│ Block Header (12 bytes)                                          │
│   disk_size, memsize, checksum                                   │
├──────────────────────────────────────────────────────────────────┤
│ Cell 0: key   [_id of document 0, BSON ObjectId, 12 bytes]      │
│ Cell 1: value [full BSON document 0, variable length]           │
│ Cell 2: key   [_id of document 1]                               │
│ Cell 3: value [full BSON document 1]                            │
│ ...                                                              │
│ Cell N-1: key   [_id of document N/2]                           │
│ Cell N:   value [full BSON document N/2]                        │
└──────────────────────────────────────────────────────────────────┘
         (entire page compressed on disk with snappy by default)

Unlike SQLite, which packs cells from the bottom of the page upward with a pointer array at the top, WiredTiger writes cells sequentially from the start of the data area. There is no pointer array. To find a specific key within a page, WiredTiger scans cells linearly or uses an in-memory search structure built when the page is loaded into cache.

Compression: A Fundamental Difference

WiredTiger compresses pages before writing them to disk. The default compression algorithm is Snappy, which is fast and achieves roughly 2:1 compression on typical BSON data. zlib and zstd are also available for higher compression ratios at a greater CPU cost.

The consequence is that the on-disk size of a page has no fixed relationship to its in-memory size. A 32KB leaf page in the WiredTiger cache might occupy only 14KB on disk. Reading that page from disk means reading 14KB, then decompressing to 32KB in memory. Writing means compressing from 32KB to some smaller size.

SQLite has no equivalent. Its pages are written to disk exactly as they exist in the page cache: 4KB in, 4KB out. What you see on disk is what you get in memory. This makes SQLite's I/O model simpler to reason about but means it cannot reduce storage or I/O volume through compression.

The compression has a direct effect on the overflow threshold.

Large Document Handling: Overflow in WiredTiger

WiredTiger uses overflow pages for values that are too large to store on a leaf page without making that page unwieldy. The overflow threshold in WiredTiger is configurable but defaults to roughly one-quarter of the maximum leaf page size. For a 32KB leaf page, values larger than approximately 8KB are candidates for overflow storage.

When a value exceeds the threshold, WiredTiger does not store any portion of it inline on the leaf page. The entire value is written to one or more dedicated overflow pages, and the leaf page stores a compact overflow reference: a 12-byte token that encodes the address of the overflow page on disk.

WiredTiger leaf page (32KB max)
┌────────────────────────────────────────────────────────────┐
│ Cell: key   [ObjectId]                                     │
│ Cell: value [overflow ref → page offset 0x3A200, len=52KB] │ ← 12 bytes
│ Cell: key   [ObjectId]                                     │
│ Cell: value [full BSON document, 4KB]                      │ ← inline
│ ...                                                         │
└────────────────────────────────────────────────────────────┘
                │
                ▼
WiredTiger overflow page
┌────────────────────────────────────────────────────────────┐
│ Page Header + Block Header                                 │
│ Raw value data (52KB uncompressed, ~24KB compressed)       │
└────────────────────────────────────────────────────────────┘

This is the opposite of SQLite's behavior. SQLite always stores the first ~4057 bytes of an overflowing cell inline and chains the rest. WiredTiger stores nothing of the overflowing value inline: the leaf page holds only the 12-byte reference and the overflow page holds the complete value.

The practical implication: in SQLite, reading a row that overflows requires at least two reads: one for the leaf page (which gives you the first portion of the data) and one or more for the overflow pages. In WiredTiger, reading a document that overflows requires at least two reads too: one for the leaf page (which gives you the reference) and one for the overflow page (which gives you the complete document). For very large documents spread across multiple overflow pages, WiredTiger chains them similarly to SQLite's linked list of overflow pages.

MongoDB enforces a 16MB limit on individual documents. This means overflow chains in WiredTiger are bounded: even in the worst case, a document requires at most a few hundred overflow pages. In practice, most documents that trigger overflow are in the tens to hundreds of kilobytes and occupy a single overflow page.

The In-Memory Format and Reconciliation

This is where WiredTiger diverges most significantly from SQLite.

When SQLite reads a page from disk into its page cache, the in-memory representation is the page itself: the same 4KB block, unmodified. Modifications are made directly to the in-memory page. When the page needs to be written back (during a checkpoint or WAL flush), the modified page is written as-is.

WiredTiger is different at every step.

When a page is read from disk, it is decompressed and then split into separate in-memory structures. The key-value pairs are unpacked from their compact cell encoding into a format that supports fast in-memory search and update. Specifically, each row on a leaf page becomes a skip list entry in memory, allowing O(log n) search within the page.

On disk (compressed, sequential cells):
┌─────────────────────────────────────────────────────┐
│ [cell: key0][cell: val0][cell: key1][cell: val1]... │
│ (Snappy compressed, 14KB on disk)                   │
└─────────────────────────────────────────────────────┘
             │
             │  read + decompress + unpack
             ▼
In memory (skip list, uncompressed):
┌─────────────────────────────────────────────────────┐
│ WT_ROW entries:                                     │
│   [key0 → val0 + update chain]                     │
│   [key1 → val1 + update chain]                     │
│   ...                                               │
│   (32KB+ in memory, skip list indexed)             │
└─────────────────────────────────────────────────────┘

Updates to documents do not modify the in-memory value in place. Instead, WiredTiger prepends an update to a linked list hanging off that row's entry in the skip list. The update list is the in-memory MVCC mechanism: each update carries a transaction ID and a timestamp, and reads select the appropriate version by walking the update list.

When the page is eventually written back (during eviction or checkpoint), it goes through reconciliation: WiredTiger walks every row in the in-memory page, selects the committed version visible to the checkpoint, serializes it into compact cell encoding, compresses the result, and writes the final block to disk. The on-disk page that results may be completely different in size and layout from the one that was originally read.

This reconciliation cost is real. A page with many small updates accumulates a long update chain. Reconciliation must walk the entire chain for every row on the page to determine the visible version. This is work that SQLite's WAL model avoids: in SQLite, the WAL contains the complete modified page, and a checkpoint simply copies WAL pages to the main database file.

MVCC: Update Chains vs WAL Pages

Both MongoDB and SQLite support multi-version concurrency control, but they implement it differently.

SQLite's WAL-mode MVCC works at the page level. When a page is modified, the new version of the entire page is written to the WAL file. Readers that started before the write see the original page (in the main database file). Readers that started after see the WAL version. Checkpointing copies WAL pages back to the main file. The page is the unit of versioning.

WiredTiger's MVCC works at the row level. Each row in the in-memory skip list carries an update chain: a linked list of modifications ordered by transaction timestamp. A read at a given timestamp walks the update chain to find the first version visible at that timestamp. A write appends to the front of the chain. The row is the unit of versioning.

The row-level MVCC has implications:

For write-heavy workloads with many concurrent transactions updating different rows on the same page, WiredTiger's row-level MVCC is more efficient than page-level MVCC. In SQLite, any modification to a page causes the entire page to be written to WAL, which serializes writers at the page level. WiredTiger allows concurrent row-level updates within the same page.

For long-running reads, WiredTiger's update chains grow unbounded in memory until the old versions are no longer needed. A slow read transaction holds a timestamp that prevents older updates from being discarded, causing memory pressure. SQLite's page-level MVCC has the same problem in a different form: old WAL pages cannot be checkpointed until all readers that started before the corresponding write have finished.

How the Page Cache Differs

SQLite's page cache is simple: a fixed-size pool of 4KB slots. Each slot holds one page. When the cache is full, a least-recently-used page is evicted by writing it to disk (if dirty) and reclaiming the slot.

WiredTiger's cache is more complex. The cache holds in-memory pages in their uncompressed, unpacked form. Because the in-memory representation is larger than the on-disk representation (due to unpacking and decompression), cache occupancy is measured in uncompressed bytes, not page count. A collection with 32KB leaf pages and 2:1 compression uses roughly twice as much cache space per page as disk space.

WiredTiger monitors cache pressure through two thresholds: the eviction trigger (default 80% of configured cache size) and the eviction target (default 95%). When the cache exceeds the trigger, background eviction threads begin reconciling and evicting dirty pages. When the cache exceeds the target, foreground operations begin participating in eviction, which directly adds latency to reads and writes.

The configured cache size matters significantly. The default is the larger of 256MB or half of system RAM. For workloads with large documents, this can be consumed quickly by a relatively small number of pages.

Comparison Summary

| Dimension                    | SQLite                                    | MongoDB (WiredTiger)                            |
| ---------------------------- | ----------------------------------------- | ----------------------------------------------- |
| Page size                    | Fixed, 4KB default (512B to 64KB)         | Variable; 4KB internal, 32KB leaf (configurable)|
| On-disk format               | Same as in-memory (no transformation)     | Compressed; different from in-memory format     |
| In-memory format             | Page cache holds exact disk pages         | Decompressed skip list with update chains        |
| Overflow threshold           | ~4057 bytes (table leaf, 4KB page)        | ~8KB (one-quarter of 32KB leaf page)            |
| Inline partial storage       | First ~4057 bytes inline, rest chained    | Nothing inline; full value on overflow page      |
| Overflow chain structure     | Linked list of 4KB pages                  | Linked overflow pages (full value, compressed)  |
| Compression                  | None                                      | Snappy by default (zlib, zstd available)        |
| MVCC granularity             | Page-level (WAL pages)                    | Row-level (in-memory update chains)             |
| Write-back mechanism         | Copy modified page to WAL                 | Reconciliation: re-serialize + compress page    |
| Max storable value           | No hard limit (page size configurable)    | 16MB per document                               |
| Document size limit          | Unlimited (row size = overflow chain)     | 16MB hard limit enforced by MongoDB layer       |

How It All Fits Together

The core structural difference is the extra transformation layer in WiredTiger. SQLite's page cache is a transparent mirror of the disk. WiredTiger's cache is a different data structure that happens to represent the same logical data as what's on disk, with the reconciliation process bridging the two forms.

Lessons Learned

WiredTiger's larger default page size shifts the overflow threshold. At 32KB leaf pages, documents up to ~8KB stay inline. Most MongoDB documents in typical workloads (user records, order documents, event logs) are well under 8KB in BSON and never trigger overflow. SQLite's 4KB page and ~4057 byte threshold means even moderately sized rows hit overflow much sooner. This is not an accident: MongoDB was designed with richer, larger documents in mind.

Compression changes the I/O math. A 32KB in-memory page might write as 14KB to disk. SQLite's 4KB page always writes as 4KB. For storage-bound workloads, WiredTiger can achieve significantly higher effective throughput despite its larger page size. For CPU-bound workloads (compression and decompression on every page boundary), the calculus reverses.

The in-memory vs on-disk split makes cache sizing critical. WiredTiger's cache holds pages in their uncompressed form. If your compression ratio is 2:1 and you have 100GB of data on disk, the effective working set in memory can be up to 200GB of uncompressed pages. Running WiredTiger with too small a cache means constant eviction, reconciliation overhead on every write path, and cache misses on reads.

Row-level MVCC is powerful but has hidden costs. Long-running transactions that hold an old read timestamp prevent WiredTiger from discarding update chain entries. On a busy write workload, this can cause in-memory update chains to grow very long, making reconciliation expensive and increasing cache pressure. MongoDB's session timeout and transaction timeout settings exist partly to prevent this.

SQLite's simplicity is a genuine advantage for its use case. The fixed-page, no-compression, page-is-the-unit-of-everything model is trivially understandable. You can reason about exactly how many I/Os a query will cost. You can compute overflow page counts with arithmetic. WiredTiger's model is more capable but significantly harder to reason about: cache occupancy, compression ratios, update chain lengths, and reconciliation timing all interact in ways that make performance prediction difficult without measurement.

References

SQLite Overflow Pages - When Your Rows Don't Fit - SQLite storage internals, previous post in this series
WiredTiger Storage Engine Documentation
WiredTiger Architecture Guide - Pages
MongoDB WiredTiger Storage Engine
WiredTiger Source - Page Format
SQLite File Format - Overflow Pages
MongoDB Production Notes - WiredTiger Cache

Conclusion

Please reach out to me here for more ideas or improvements.

How MongoDB Stores Data - WiredTiger Pages vs SQLite