Gaurav Sarma

Your database has gigabytes of dirty pages in memory. At some point they need to hit disk. The naive approach is to pause all writes, flush everything cleanly, and resume. It works, but it means your p99 latency spikes every few minutes, your write throughput drops to zero for hundreds of milliseconds, and your on-call team gets paged. Every major storage system has had to solve this. The solutions are more varied than you'd expect.

The Problem

A checkpoint has one job: produce a consistent snapshot of the database on disk so that, after a crash, recovery does not have to replay the entire write-ahead log from the beginning.

The tricky part is "consistent." If you flush page 42 at time T1 and page 43 at time T2, and a transaction modified both between T1 and T2, you now have a disk image that never existed in memory. Recovering from that image gives you a corrupted database.

The brute-force solution is a "sharp checkpoint": freeze all writes, flush everything, unfreeze. You get a provably consistent image, but you also get a multi-hundred-millisecond stall. For a OLTP system doing 50,000 writes per second, that stall shows up as a cliff in your latency histogram every time the checkpoint fires.

The alternatives, used by virtually every production database, are collectively called "fuzzy" or "online" checkpointing. The core insight: you do not need to freeze the world if you have a way to reconstruct what the state was at a specific point in time, even while the state continues to change.

Prerequisites

Familiarity with write-ahead logging (WAL) at a conceptual level
Basic understanding of buffer pool management in databases
Knowing what "LSN" (Log Sequence Number) means helps for the PostgreSQL section
Awareness of what copy-on-write semantics are at the OS level

The Approaches

Fuzzy Checkpointing with WAL Replay (PostgreSQL)

PostgreSQL's checkpoint does not stop writes. Instead it does this:

Record the current WAL position as the "checkpoint start LSN" (redo point).
Begin scanning the buffer pool and writing dirty pages to disk in the background, via the bgwriter and checkpointer processes.
While this is happening, normal write traffic continues. Pages that were already flushed can get dirtied again. That is fine.
When all pages that were dirty at step 1 have been flushed, write a CHECKPOINT record to the WAL with the redo point from step 1.
Update pg_control to record the new checkpoint location.

The result is not a clean snapshot. Some pages on disk reflect state after the redo point. But that is acceptable, because on crash recovery, PostgreSQL replays the WAL forward from the redo point. Any page written after the redo point will be overwritten with the correct version from the WAL. Pages written before the redo point are already durable.

The key invariant is not "all pages are consistent with each other." It is "all pages are at least as old as the redo point, and the WAL from the redo point forward is complete." Recovery corrects everything else.

Timeline:
  LSN 1000: dirty pages start flushing  <-- redo point
  LSN 1020: page 42 flushed (state from LSN 1005)
  LSN 1040: page 43 flushed (state from LSN 1038, after redo point -- this is fine)
  LSN 1050: CHECKPOINT record written

Crash at LSN 1045:
  Recovery replays WAL from LSN 1000 forward.
  Page 42 gets replayed to its correct state.
  Page 43 is already current.

One subtlety: full_page_writes. The first time a page is modified after a checkpoint starts, PostgreSQL writes the entire page image into the WAL, not just the change. This guards against partial writes: if the OS crashes mid-page-write, the full-page image in the WAL can restore the page before replaying the diff. It costs WAL volume but eliminates a whole class of corruption.

The cost of fuzzy checkpointing in PostgreSQL is I/O spread: the checkpointer deliberately throttles its write rate (controlled by checkpoint_completion_target, default 0.9) to avoid a burst of I/O that would starve foreground queries. You trade a short pause for a longer, gentler I/O ramp.

Shadow Paging with WAL Checkpointing (SQLite WAL Mode)

SQLite's WAL mode flips the architecture. Instead of writing to the main database file and logging changes separately, it writes only to the WAL file during transactions. The main database file is the "checkpoint," and it is always consistent because it is only updated during an explicit checkpoint operation.

Reads check the WAL first. If a page appears in the WAL, that version is used. Otherwise the main file is read. This means readers never block writers and writers never block readers, which is the headline feature of WAL mode.

A checkpoint copies pages from the WAL back to the main database file. The tricky part: you cannot overwrite a WAL page that a current reader might still need. SQLite tracks this with "read marks," a small array of frame numbers indicating the WAL position at which each active reader started. A checkpoint can only copy WAL frames up to the minimum read mark.

// Simplified: SQLite WAL checkpoint logic
for (frame = 0; frame < wal->nBackfill; frame++) {
    if (frame >= minReadMark) break;  // don't overwrite frames active readers need
    copyFrameToDatabase(wal, frame);
}
wal->nBackfill = frame;

The checkpoint is non-blocking by default (PASSIVE mode): it copies as many frames as it can without waiting for readers. Frames that active readers are sitting on get left in the WAL. The WAL never truncates until all frames can be checkpointed (or you use TRUNCATE mode and accept that readers might have to block briefly).

This means in write-heavy workloads, the WAL can grow unboundedly if a long-running reader is holding back the checkpoint. This is the main operational footgun in SQLite WAL mode.

Fork-Based Snapshot (Redis BGSAVE)

Redis keeps its entire dataset in memory. Persisting it to disk (the RDB file) requires serializing potentially gigabytes of data. Redis's answer: fork().

$ redis-cli BGSAVE
Background saving started

When BGSAVE runs, Redis calls fork() to create a child process. The child gets a copy-on-write view of the parent's memory at the exact moment of the fork. The child then walks all the data structures and writes them to a new RDB file sequentially.

The parent continues serving writes. When the parent modifies a memory page, the OS creates a private copy for the parent, leaving the child's view (the original page) intact. The child always sees the consistent snapshot from the fork point, regardless of what the parent does.

Parent process (writes continue):
  [page A] -> modified, OS creates copy, parent gets new page
  [page B] -> unmodified, parent and child share the same physical page

Child process (reads from fork-point snapshot):
  [page A] -> reads original version (before parent's write)
  [page B] -> reads shared page (same as parent, no copy needed)

The cost is memory. In the worst case, if every page is written during the fork, memory usage doubles. Redis exposes this as rdb_changes_since_last_save and used_memory_rss, and it is the reason why Redis instances need headroom above their working set size. A 16 GB Redis instance on a 20 GB host will run out of memory during a checkpoint under heavy write load.

The RDB file is written atomically: the child writes to a temp file and renames it over the old RDB on completion. If the child crashes, the old RDB is intact.

Memtable Flush and Compaction Pipeline (RocksDB)

RocksDB does not have a traditional checkpoint in the database sense. Writes go to a MemTable (an in-memory skip list), and when the MemTable reaches a size threshold, it is converted to an immutable MemTable and a new active MemTable is allocated. A background thread then flushes the immutable MemTable to an SSTable file on disk (Level 0).

Write path:
  WAL append (synchronous, configurable) --> MemTable insert
                                              |
                                    [MemTable full]
                                              |
                              Rotate to immutable MemTable
                              Allocate new active MemTable
                                              |
                           [Background flush thread]
                                              |
                              Write L0 SSTable to disk

The flush itself never blocks writes because the active MemTable is separate from the immutable one being flushed. Writes accumulate in the new active MemTable while the flush proceeds. The WAL guarantees durability: even if the flush has not finished, a crash can be recovered by replaying the WAL.

RocksDB also supports GetLiveFiles() for point-in-time snapshots. This is used by tools like rocksdb_checkpoint and by TiKV for consistent backups. It works by flushing the MemTable to L0, then hardlinking all current SSTable files into a new directory. Hardlinks are instantaneous and the files are immutable once written, so this is a consistent snapshot with no write stall.

// RocksDB checkpoint: flush memtable, then hardlink all SSTables
Status Checkpoint::CreateCheckpoint(const std::string& checkpoint_dir) {
    // 1. Flush memtable to L0 SSTable
    db_->Flush(FlushOptions());

    // 2. Get list of all live SSTable files
    std::vector<std::string> live_files;
    db_->GetLiveFiles(live_files, &manifest_file_size);

    // 3. Hardlink each SSTable into the checkpoint directory
    for (const auto& file : live_files) {
        env_->LinkFile(db_dir + file, checkpoint_dir + file);
    }
    // Hardlinks are atomic at the filesystem level -- no partial state possible
}

The compaction process (merging L0 through LN SSTables) runs entirely in the background and never blocks reads or writes. Reads consult all levels concurrently using a consistent view; the old SSTable files are not deleted until all active iterators pointing to them have been released.

WiredTiger's Hazard Pointers and Checkpoint Cursor (MongoDB)

WiredTiger, the storage engine behind MongoDB since 3.0, uses a B-tree structure with a checkpoint mechanism that is closer to PostgreSQL's fuzzy checkpoint but implemented with its own concurrency primitives.

WiredTiger maintains two "checkpoints" at all times: the last durable checkpoint (on disk) and the in-progress one being built. When a checkpoint starts, it records the current "stable timestamp" (in MongoDB, this is coordinated with the replication system so only majority-committed writes are checkpointed). It then walks all modified B-tree pages and writes them to disk.

Concurrent readers use "hazard pointers": before reading a page, a thread registers the page's address. The checkpoint process checks hazard pointers before evicting or overwriting a page, ensuring it does not free memory that a reader is actively using. This is a form of lock-free synchronization that avoids any global pause.

The checkpoint writes to a new location on disk rather than overwriting the old pages (WiredTiger uses append-only writes). When the checkpoint completes, it updates a small metadata file atomically. The old pages become garbage and are reclaimed on the next pass. If the process crashes mid-checkpoint, the metadata file still points to the previous valid checkpoint, and recovery replays the journal (WiredTiger's WAL) from that point.

Disk layout during checkpoint:
  [checkpoint N: pages A, B, C at offset 0x1000, 0x2000, 0x3000]
  [in-progress writes: pages A', B' at offset 0x8000, 0x9000]

  Crash mid-checkpoint:
    metadata.json still points to checkpoint N
    Recovery replays journal from checkpoint N timestamp
    Pages A', B' at 0x8000 are ignored (never committed)

MongoDB exposes the checkpoint interval via storage.syncPeriodSecs (default: 60 seconds). The checkpoint does not stall writes, but it does consume I/O bandwidth. On heavily loaded systems, this can cause latency spikes if the disk is saturated; the fix is usually faster storage or more aggressive wiredTigerCacheSizeGB tuning to reduce the dirty page ratio.

How It All Fits Together

Every non-blocking checkpoint strategy reduces to one of three primitives, or a combination:

1. Record where you are, flush async, replay the log forward from that point
   (PostgreSQL fuzzy checkpoint, WiredTiger)

2. Write to a side channel, checkpoint = merge side channel back to main store
   (SQLite WAL, RocksDB L0 flush)

3. Fork the process to get a copy-on-write snapshot, serialize from the child
   (Redis BGSAVE)

The trade-offs follow directly from the primitive:

| System | Primitive | Write stall | Memory overhead | Recovery cost |
|--------|-----------|-------------|-----------------|---------------|
| PostgreSQL | WAL + fuzzy flush | None (I/O spread) | Low | Replay from redo point |
| SQLite WAL | Side-channel merge | Brief (TRUNCATE mode) | Low (WAL file) | WAL replay |
| Redis BGSAVE | fork() | None | Up to 2x RSS | None (RDB is full snapshot) |
| RocksDB | Immutable flush | None | MemTable per flush | WAL replay to L0 |
| WiredTiger | Append-only + hazard ptrs | None (I/O bound) | Low | Journal replay |

Lessons Learned

The "no stall" claim in most systems documentation is technically true but practically incomplete. PostgreSQL does not pause writes during a checkpoint, but it does throttle them via checkpoint_completion_target to spread I/O. Redis does not stall the parent, but the child's memory pressure can trigger OOM or swap thrashing. RocksDB flushes do not stall unless you hit the write buffer limit and the flush thread falls behind.

The practical lesson: checkpoint behavior is only observable under load. A system that checkpoints cleanly at 10% write saturation may stall badly at 80% because the background flush cannot keep up with the incoming write rate. Tuning checkpoint aggressiveness (frequency, write rate, buffer size) is always workload-specific.

The other non-obvious cost is recovery time. A fuzzy checkpoint is cheap to produce but more expensive to recover from, because recovery must replay the WAL forward. A full snapshot (Redis RDB, RocksDB checkpoint via GetLiveFiles()) has a higher upfront cost but zero WAL replay on restart. For systems with multi-hour WAL streams, the recovery time difference matters a lot.

What's Next

The next layer of this problem is distributed checkpointing: how do you produce a consistent snapshot across multiple nodes without a global pause? Chandy-Lamport gives you the theoretical model, but systems like Flink (asynchronous barrier snapshotting) and Spanner (TrueTime-based snapshot reads) have had to bend those ideas considerably to make them work at production scale. That is a different post.

How Databases Checkpoint to Disk Without Stopping the World