Automatic indexing of a changed/new/deleted entity happens after the corresponding DB transaction has been committed (this is the default behaviour I guess). This means that if an index operation fails for some reason (ES cluster gone, network error, application redeployment, etc.) the search backend (Lucene or ES) will be out of sync with the database.
What is the best practice to handle such a situation? I can think only about running a background process, which tries to reindex affected entities (kind of Transactional Outbox pattern). This assumes that the process can identify those entities efficiently (e.g. kind of event sourcing or marking them with indexedAt flag, which means touching every single successfully indexed entity, etc.). Means quite a lot of work to do ;).
Any other recommendations?
The solutions you’re mentioning are basically planned for Hibernate Search 6.1, where we’ll introduce a way to index asynchronously, probably storing the queue of entities to index in Kafka, and then handling errors will only be a matter of redirecting events that couldn’t be processed to the correct queue.
In the meantime, in Hibernate Search 6, your only solution if you have an unreliable network/cluster, is to collect the failures through a custom automatic indexing synchronization strategy. I’d recommend starting from whatever implementation you’re currently using, and customizing the “indexingFutureHandler”. The future passed to that handler will ultimately yield and indexing report with a list of all the entities that failed indexing.
Alternatively, if you’re using the
async automatic indexing synchronization strategy , these failures are already forwarded to the “background failure handler”, which can be configured.
I think that catching indexing errors will not help, because the catching logic may stop/fail (e.g. application restart/redeployment) and those errors get lost -> cluster out of sync. Anything done after the DB transaction commit may fail and would lead to out-of-sync situation.
The same applies to storing a queue of entities to index in Kafka - how are you going to sync that queue with the database? The transaction is done, push to Kafka fails -> cluster out-of-sync.
In order to make it safe, I see only one way: the fact, which entities need to be indexed, must be stored together with those entities in the same DB inside the same transaction used to update/insert/delete those entities (a separate table aka event sourcing). In this case it is guaranteed that the knowledge of which entities to index will not be lost. Storing that knowledge in any other system (Kafka or a file system) would require 2-phase commit, which is quite a pain and usually not an option. When automatic indexing succeeds it has to mark those entities/events as processed. If automatic indexing fails, those entities/events must be retried by an async process. Or do I miss something?
If you want that level of safety, then yes, you’re out of luck with the current versions of Hibernate Search. It’s simply not what it was designed for. We generally assume that this level of failure requires full reindexing. From experience on non-distributed applications, it wasn’t a problem; it will probably be on large-scale distributed applications.
Regarding sync, be aware that you’ll never get complete, immediate consistency between the database and the index. The best we can hope for is eventual consistency: after X seconds, database changes are mirrored in the index. Some people use automatic indexing for most of their data, but exclude some associations because it would slow down their application too much. They launch a full reindexing periodically, and accept that the index may be out of sync in the meantime. What level of consistency you can accept is, obviously, specific to your own requirements.
Regarding Kafka, there are multiple solutions to ensuring no events are lost:
- The transactional outbox pattern you mentioned.
- Some kind of two-phase commit, though I doubt it’s available/efficient with Kafka.
- Some custom, simpler two-phase commit: send the events to a Kafka queue with the explicit instruction that they should not be processed before X seconds, where X is a number of seconds beyond which you’re reasonably certain the commit will have been performed. Rollbacks don’t matter in this case: we’d just reindex unnecessarily.
- Change Data Capture, e.g. with Debezium: we’d source the events from the database’s transaction log instead of sourcing them from Hibernate ORM like we currently do.
Which solution we’ll pick exactly is not determined yet. If you’re interested in shaping out a prototype, we can definitely talk about it here or on our chat.
yes, eventual consistency is unavoidable in such a case. Important is that the index catches up with the database at some point and no updates get lost.
Outbox seems to me to be the least intrusive solution. Just an additional table for storing the events and some synchronisation logic above to make it work in a clustered environment (e.g. when the application runs in several pods on Kubernetes). Easy to setup, no infrastructure needed. I used to implement outbox using Quartz Scheduler - it can synchronise the jobs in a clustered environment and it just needs a few additional tables in the database, which it creates and manages automatically. So Hibernate Search could implement similar logic (without Quartz of course ;)). Kafka and Debezium would require additional infrastructure and would rather scare people.
I am on the chat already ;). Is there a topic for that already?