Your orders service needs the user's email to send a confirmation. The payments service needs the user's tier to apply a discount. The analytics pipeline needs the user's signup date for cohort analysis. The search service needs the user's display name for autocomplete.
So someone sets up a CDC pipeline from the users table, and now four services each have their own copy of every user. A column gets renamed in the source. The CDC connector picks up the new schema. Three consumers handle it. One doesn't. A customer sees stale pricing for six hours, and the on-call engineer spends a full shift figuring out which copy of the data diverged.
CDC didn't cause this. But it made it effortless, and effortless replication without a clear philosophy behind it is how you end up with 12 copies of the same table, zero confidence in any of them, and a reconciliation job that runs every night "just in case."
The Problem
Change Data Capture is a genuinely useful pattern. It captures row-level changes from a database's write-ahead log (or oplog, or binlog) and streams them to downstream consumers. Tools like Debezium, Kafka Connect, and managed offerings from every major cloud provider have made it trivially easy to set up.
The problem is not CDC itself. The problem is that CDC reduces the cost of replication to near zero without reducing the cost of maintaining replicas. Setting up a CDC connector takes an afternoon. Debugging a consistency issue across six consumers of that connector takes a week.
Here is what typically happens:
- Service A needs one field from Service B's database
- Someone suggests an API call, but there are latency concerns, or Service B's API doesn't expose that field, or "what if Service B is down?"
- Someone else says: "just set up CDC, replicate the table, query it locally"
- It works. It's fast. Nobody thinks about it for six months
- The source schema changes. Or the CDC pipeline lags. Or a consumer's local copy drifts because of a bug in its ingestion logic
- Now you have a distributed data consistency problem that didn't need to exist
The pattern repeats across organizations because the incentives are misaligned: the team setting up the CDC pipeline bears almost no cost. The cost lands on whoever debugs the divergence months later.
Prerequisites
- Familiarity with CDC concepts (WAL-based change capture, Debezium, Kafka Connect, or similar)
- Experience with microservice architectures where services own their own databases
- General understanding of eventual consistency and its operational implications
Technical Decisions
Why Teams Reach for CDC
Before defining when not to replicate, it helps to understand why replication feels like the obvious answer.
Latency. A local query is faster than a network call. If the payments service needs the user's tier on every transaction, a local lookup in its own database avoids a synchronous dependency on the users service.
Availability. If the users service goes down, every service that depends on its API also degrades. A local copy means the payments service can keep processing even during an outage upstream.
Query flexibility. The users service exposes a REST API with specific endpoints. The analytics team needs to join users with events in SQL. CDC lets them replicate the users table into their warehouse and query it however they want.
These are all real concerns. The mistake is treating them as universal justifications. Not every consumer needs sub-millisecond latency. Not every service needs to survive an upstream outage. Not every query pattern requires a local copy.
The Three Modes of Data Consumption
Every time a service needs data from another service, the interaction falls into one of three categories:
1. Replicate: the consumer needs the data at rest, in a different shape
This is the legitimate core use case for CDC. The consumer is not just mirroring the source, it is transforming the data to serve a fundamentally different access pattern.
Examples:
- Replicating a normalized relational table into Elasticsearch for full-text search
- Feeding transactional data into a columnar warehouse for analytical queries
- Building a materialized view that pre-joins three tables for a read-heavy dashboard
The key signal: the consumer's schema looks nothing like the source. It has different indexes, different denormalization, maybe even a different data model entirely. You cannot serve this use case with an API call because the consumer needs to query the data in ways the source was never designed for.
2. Query the source: the consumer needs fresh data in the same shape
If the consumer is essentially mirroring the source table, same columns, same structure, just in a different database, ask why it needs a copy at all.
Examples:
- The orders service needs the user's email to send a confirmation (one field, one lookup per order)
- A dashboard needs the current count of active users (a single aggregation query)
- A service needs to validate that a product ID exists before creating an order
The key signal: the consumer's query would work fine against the source database. The data is small, the access pattern is simple, and freshness matters. A synchronous API call or a thin caching layer is simpler, cheaper, and always consistent.
The "but what if the source is down?" objection is real but often overstated. If the users service is down, should the payments service really continue processing with stale user data? Sometimes yes, but often the correct behavior is to degrade gracefully, not to silently use a copy that might be hours behind.
3. React and discard: the consumer needs to respond to a change, not store it
Many CDC consumers don't actually need the data at rest. They need to do something when the data changes.
Examples:
- Send a welcome email when a new user is created
- Invalidate a cache entry when a product price changes
- Trigger a fraud check when a transaction is created
- Update a counter or metric in a monitoring system
The key signal: the consumer processes the event and is done. It doesn't need to query the data later. It doesn't build a local copy. The event is a trigger, not a data transfer.
This is often the most over-engineered pattern. Teams set up full CDC replication when all they needed was an event bus. The consumer ends up with a complete replica of the users table just to detect new signups.
Implementation
The Decision Framework
Before setting up a CDC pipeline, run through these five questions:
Question 1: Does the consumer need the data at rest, or does it just need to react to changes?
If the answer is "react," you don't need CDC replication. You need an event. Publish a domain event ("user.created," "order.completed") from the source service, let the consumer subscribe, process, and move on. No local copy, no sync lag, no schema coupling.
Question 2: Can the consumer tolerate staleness?
CDC is eventually consistent. Depending on your pipeline, the lag can range from milliseconds to minutes. If the consumer cannot tolerate any staleness (e.g., checking a user's balance before authorizing a payment), a local replica is the wrong answer. You need a synchronous read from the source of truth.
Question 3: Is the consumer reshaping the data or mirroring it?
This is the most important question. If the consumer's table is structurally identical to the source, you have a mirror, not a materialized view. Mirrors are almost always a sign of a missing API or an over-cautious availability concern.
Reshaping is legitimate. Mirroring is a code smell.
Question 4: Who owns the schema?
When the source team renames a column, adds a field, or changes a type, what happens downstream? If 12 CDC consumers ingest that table, each one needs to handle the schema change. You've built a distributed monolith: tightly coupled systems connected by a log instead of an API.
At least with an API, the source team can version it, deprecate fields gracefully, and maintain a contract. With raw CDC, the contract is the database schema itself, and database schemas were never designed to be public interfaces.
Question 5: What is the blast radius of divergence?
When (not if) the local copy drifts from the source, what breaks? If the answer is "a customer sees the wrong price" or "a payment is authorized against stale data," the operational risk of maintaining a replica outweighs the convenience.
The Decision Table
| Signal | Replicate | Query Source | React & Discard |
|---|---|---|---|
| Consumer schema differs from source | Yes | ||
| Consumer needs sub-millisecond reads | Yes | ||
| Consumer must survive source outages | Yes | ||
| Consumer mirrors source schema | Yes | ||
| Consumer needs strong consistency | Yes | ||
| Access is infrequent or low-volume | Yes | ||
| Consumer processes event then is done | Yes | ||
| Consumer doesn't query the data later | Yes | ||
| Consumer only needs to trigger a side effect | Yes |
The Distributed Monolith Antipattern
The most dangerous failure mode of CDC overuse is the distributed monolith. It looks like this:
Users DB (source of truth)
│
├── CDC → Kafka → Orders Service (local users table)
├── CDC → Kafka → Payments Service (local users table)
├── CDC → Kafka → Analytics Warehouse (local users table)
├── CDC → Kafka → Search Service (users in Elasticsearch)
├── CDC → Kafka → Notifications Service (local users table)
└── CDC → Kafka → Fraud Service (local users table)
Six consumers. Five of them have a structurally identical copy of the users table. When the source team adds a phone_verified boolean:
- The Kafka connector picks up the new column
- Analytics handles it fine (their ingestion is schema-flexible)
- Search re-indexes (Elasticsearch is schema-flexible)
- Orders, Payments, and Notifications have rigid table schemas. Their CDC consumers fail to deserialize the new column. Events back up in Kafka. A lag alert fires at 3am.
This is tight coupling with extra steps. The teams thought they were decoupled because there's no synchronous API call. But they're coupled to the schema, coupled to the CDC pipeline's uptime, and coupled to Kafka's consumer group coordination. The coupling just moved from the request path to the data path, which is harder to see and harder to debug.
The fix is not "make the CDC pipeline more resilient." The fix is to ask: do Orders, Payments, and Notifications actually need a full copy of the users table? Usually the answer is no. Orders needs the user's email. Payments needs the user's tier. Notifications needs the user's preferences. These are API calls, not replication use cases.
CDC Used Well
For contrast, here are patterns where CDC genuinely earns its complexity:
Search indexing. Elasticsearch needs a denormalized, full-text-indexed copy of your data in a fundamentally different structure. You cannot serve this with API calls. CDC into Elasticsearch (or OpenSearch, or Typesense) is one of the cleanest uses of the pattern.
Analytics and data warehousing. Your warehouse needs historical, append-only data in a columnar format for analytical queries that your OLTP database was never designed to serve. CDC into BigQuery, Snowflake, or Redshift is the standard pattern here, and it works because the consumer is reshaping, not mirroring.
Materialized views across service boundaries. A dashboard needs to show data that joins across three services' databases. Rather than making three API calls on every page load, you CDC the relevant tables into a read-optimized store and materialize the join. The consumer's schema is a purpose-built denormalization, not a copy.
Event sourcing integration. CDC from an existing database into an event store lets you incrementally adopt event-driven patterns without rewriting the source. The events are derived from real state changes, not synthetic.
In each case, the consumer is transforming the data to serve a purpose the source cannot.
How It All Fits Together
The philosophy boils down to one principle: replicate shape, not data.
If the consumer needs the data in a different shape (different indexes, different joins, different query patterns), replication is justified because no amount of API design can bridge the gap between an OLTP row store and a full-text search engine.
If the consumer needs the same data in the same shape, you don't have a replication problem. You have a service boundary problem. Fix the boundary: expose an API, add a caching layer, or reconsider whether the data should live in that service at all.
If the consumer doesn't need the data at rest, you don't have a replication problem either. You have an eventing problem. Publish domain events, not database changelogs.
┌─────────────────────┐
│ Does the consumer │
│ need data at rest? │
└─────────┬───────────┘
┌────┴────┐
Yes No
│ │
┌────────┴──┐ React &
│ Different │ Discard
│ shape? │ (events)
└────┬──────┘
┌────┴────┐
Yes No
│ │
Replicate Query
(CDC) Source
(API)
Lessons Learned
The cost of replication is not in the setup. Setting up a Debezium connector takes an afternoon. Maintaining schema compatibility across consumers, debugging lag-induced inconsistencies, and running reconciliation jobs to detect drift, that's where the real cost lives. Teams consistently underestimate this because the feedback loop is months long.
"What if the source is down?" is not always a replication argument. Sometimes the correct behavior during an upstream outage is to degrade, not to silently serve stale data. A payment authorized against a cached user tier that changed two hours ago is worse than a payment that fails gracefully and retries.
Mirrors masquerade as materialized views. The most common CDC antipattern is a consumer that replicates a table into an identical schema in its own database. If you can describe the consumer's data model by saying "it's the same as the source, but local," you almost certainly don't need CDC.
Schema coupling is still coupling. Moving from API coupling to schema coupling via CDC doesn't decouple your services. It makes the coupling implicit, which is worse. At least APIs have versioning, contracts, and deprecation policies. A database schema has none of those when it's being consumed through a WAL stream.
CDC pipelines need SLOs. If you do replicate, treat the pipeline as a production system. Define acceptable lag (e.g., p99 under 30 seconds), monitor consumer offsets, alert on schema changes, and have a runbook for when the pipeline breaks. Most teams set up CDC and forget it, then discover during an incident that the pipeline has been broken for days.
What's Next
This post focused on the when and why of data replication. A natural follow-up is the mechanics of keeping replicas consistent once you've decided replication is justified:
- Schema evolution strategies: How do you handle source schema changes without breaking consumers? Avro with a schema registry, Protobuf, or JSON Schema each have different guarantees.
- Lag monitoring and SLOs: How do you define and measure "fresh enough" for each consumer? What's the operational playbook when lag spikes?
- Reconciliation patterns: When you do detect drift between a replica and its source, how do you fix it without a full re-sync?
References
- Debezium: Change Data Capture for Databases
- Turning the Database Inside-Out (Martin Kleppmann, 2015)
- Designing Data-Intensive Applications (Martin Kleppmann, O'Reilly)
- The Log: What Every Software Engineer Should Know About Real-Time Data (Jay Kreps)
- Implement your own CDC using Kafka (gauravsarma.com)