Gaurav Sarma

Just heard of an incident in a company where a dev turned off the config allow_partial_search_results in their Elasticsearch queries. This config is turned on by default. When they explicitly turned it off, the queries started failing more often and long running queries took double the time to complete. I have been bit by the same issue previously. It sounds nice when you can say that you don't want partial search results. But in distributed systems, things aren't black and white. The config says that if it's disabled, then if it's unable to fetch all the results from a specific, then the entire query for that shard would be failed. So if you are trying to fetch 1000 documents, and only 900 are fetched, it would fail the query. It sounds all good logically. But in the real workloads, you aren't the only person running queries on the system. So there will be cases when a particular shard is under load. When a shard is under load, it may not have the required resources to run a large query. Since it wasn't able to fetch the required resources multiple times, there were multiple query failures on multiple shards.

Solution: Instead of using the config to ensure that there is no partial update, monitor the ES server's response where it mentions the number of rows updated and shard failures as well. This means that even if there are partial updates, there is still significant progress which happened and a retry for the same query will result in reduced results and will finish subsequently without any errors.

This is a common approach called "Write and Verify" which is prevalent in distributed systems.

The config should be turned off when you cannot afford data integrity issues at all and you don't want to handle it yourself.