Any "delayed" Index synchronization strategy?


I’am working (or studying for the moment), for the possibility to delay the index synchronisation strategy. I explain.
Currently, we have the choice for 4 synchronization strategies :


Can we imagine a “delayed” strategy which consist of stacking index changes into a queue (at final, it ends with a rest call with ES backend I suppose). I imagine a scenario where I’am reindexing all data with massIndexer to another index alias (myindex-rewrite) with “zero-downtime”. During that time, people use the application and triggers automatic indexing for writing the updates to indexes, but stack them in addition to play them on the current index. Because, when the massindexer finished, I need to replay the “work” done by the users. Otherwise if before mass indexing I turn off the automatic indexing, I lose all their “work” and search results are wrong.
If one entry can’t be played on the newly indexes, then log them (In case schema changes, eg a field in a document has been removed).
The Async strategy doesn’t allow to retain during a fixed amount of time the automatic reindexation calls.
So even if still not perfect, the “delayed” strategy is better than disable the automatic indexing.

Am I wrong ?



You’re pretty much describing what the outbox-polling coordination strategy does. I suggest you have a look at this.

That coordination strategy sends entity change events to a queue in the database (a table), and has background processors to process those events and perform the indexing. When you start mass indexing, entity change events are still pushed to that queue, but the processors are temporarily stopped, until the mass indexer finishes. That all happens transparently, without you having to do anything.

That being said, schema changes are still an open problem. It’s unfortunately not as simple as notifying of problems caused by changed fields; there could also be changes in how reindexing must happen.

For example if a Book entity has no @IndexedEmbedded on its author property, then changing an author won’t trigger any event to reindex a book when an author changes. But if you change your mapping to add @IndexedEmbedded to Book#author, and someone modifies an author while you’re reindexing, there’s a chance you’ll simply have a missing event…

So, yeah, it’s not simple. We’re working on it, but it’s not going to be solved in one day. Relevant tickets: [HSEARCH-3499] - Hibernate JIRA, [HSEARCH-2861] - Hibernate JIRA (and we’re probably need to work on more than that).

I still think the outbox-polling coordination strategy would be a good idea in your case, though.

I tested on my post your solution, and it works like a charm. I did it in multi-tenant environnement (but only one defined) and it is a costless architecture at the basis. I experimented the dropAndCreateSchema(true) of the massindexer, and saw well the outbox table still waiting for executing the automatic indexing. We don’t use a multi-node environnement, this scalability is not necessary for us. So yes, the next problem would be when schema change (analyser/tokenizer for @FulltextField, Type of field, etc.) based on the elastic search URLs changes : {{elserver}}/myindex-read/_mapping and {{elserver}}/myindex-read/_settings.
For the moment, the only one solution for it, is to warn the exploitation team to perform a full reindexation on a specific index if it is the case (it implies that the strategy is set to NONE to allow application start up before performing the full reindexation on the fly). But most of the time, the strategy CREATE_OR_VALIDATE is sufficient, and then outbox-polling allow us to start the application gracefully and re-indexing - during application runtime - the existing documents with possibly added/removed field(s) .

Thank you very much.

1 Like