Need to limit auto reindexing when only excluded properties change

We have huge indexed transaction tables that embed indexes of reference tables, and indexes for some of the most frequently used reference tables also contain embedded indexes which change continuously. There is limited dependency shared between the embedded properties in the reference tables which change frequently, and those that are embedded in the transaction tables which do not.

In Hibernate Search 5, there appears to be no optimization or detection of this use case in place. As a result, a change to an index embedded in a reference table spirals off into a huge and unnecessary amount of automatic reindexing of the transaction tables and their respective embedded indexes that produce no useful changes. It becomes painfully slow to the point we have to limit our use of Hibernate Search in order to have a modicum of productive performance.

In Hibernate Search 5 there seems to be an assumption being made that a change to any indexed property must reindex every @ContainedIn association to an infinite depth up the chain, even though the developer can filter properties via includePaths on the other side, such that the reason for the refresh of the embeddable index actually has no affect on a given container.

I am hoping Hibernate Search 6 will begin using includePaths of @IndexedEmbedded intelligently, to avoid traversing an association during automatic reindexing, if there will be no useful change made. I would like to confirm this has been implemented in the next release.

What I am looking for could be described as a @ContainedIn annotation at the property level, with a list of applicable inverse associations to which it applies, instead of having a global @ContainedIn on the inverse association applied to every property, (included or not).

The same could be accomplished by using the includePaths of @IndexedEmbedded on the forward association to identify when we can safely stop traversing the hierarchical tree. I understand this is a complex dependency calculation, but the performance gained by implementing this will far outweigh the development effort.

The denormalization and duplication of all needed relations into a single index, to avoid having to fully implement a “join” feature between indexes in the underlying search engines, has caused the need for this highly pessimistic refresh behavior, but presently it is too pessimistic. (I do believe that from a technical standpoint, the possibility must exist to implement an index join feature in lucene, but that is another topic.)

I see that Hibernate Search 6 is starting to automatically determine the inverse side of @IndexedEmbedded without a @ContainedIn, and that more indexing can be triggered through @IndexingDependency, which is nice, but we also need new ways to achieve less automatic reindexing across @IndexedEmbedded/@ContainedIn pathways.

Please let me know if what we are looking for will be or is already implemented, or if an enhancement request is needed for Hibernate Search 6. Thank you!

I think you may know what follows already, but just to check we’re on the same page, and for anyone else reading this…

Are you talking about thousands of entities referencing the same “reference entity instance”, and the “reference entity instance” is sometimes changed in a way that require reindexing? If so, you should definitely stay away from automatic reindexing for this particular association. Hibernate Search currently loads all entities pointed by associations when it reindexes, and doesn’t clear the session at any point (because it can’t). So such an association may take ages to process, and require a significant amount of memory, since it will load a lot of entities.

The only way to handle this currently would be to ignore this association when it comes to automatic reindexing, and reindex affected entities overnight when you know a reference entities has changed during the last 24h. It’s not currently handled by Hibernate Search itself (though we’ve been considering it), so you’ll have to implement the mechanism to detect changes during the last 24h by hand, then trigger the mass indexer.

Right. There are optimizations in place in Hibernate Search 5, but with two limitations:

  • As soon as you use a @ClassBridge on an entity, all optimizations get disabled, because Hibernate Search is unable to determine which properties are used to build the indexed document.
  • The list of properties that trigger reindexing are “absolute”. If A index-embeds B and both A and B are indexed, then there may be some properties that are only used in the indexed form of B. Hibernate Search will ignore that and will reindex both A and B when these properties change.

I confirm this has been implemented in Hibernate Search 6. It’s been quite some time, but the optimization is definitely present in Hibernate Search 6.0.0.Beta8.

In short, Hibernate Search 6 uses a “richer” metadata tree when it comes to reindexing. Instead of having a single list of properties triggering reindexing for each type, it has (for each type) a tree representing associations that may need to be processed for reindexing. Some nodes in this tree hold a list of properties that need to have changed in order to trigger reindexing of this “branch” of the tree.

This need should be covered:

  • As mentioned just above, there are now optimizations in place to only reindex when we really need to.
  • Type bridges (the new name of “class bridges”) now provide ways to designate the properties they rely on, thereby allowing optimizations to stay enabled.
  • In cases where automatic, in-session reindexing is not realistic (due to thousands of entities needing to be reindexed), you can mark an association as “ignored for automatic reindexing” using @IndexingDependency(reindexOnUpdate = ReindexOnUpdate.NO). See this part of the documentation.

As far as I can tell, what you’re asking for is already there.

There is of course the matter of implementing “fully asynchronous” reindexing, where associations are not accessed in the same session as the one used to modify entities, but instead entity IDs are pushed to a queue, and entities are later loaded in a background process for reindexing. This could be a first step towards handling high-cardinality associations such as the association to a “reference entity” you mentioned.
This new feature is currently filed HSEARCH-2364, but we won’t take the time to implement it in Search 6.0.0.Final. I hope that we’ll do that in 6.1. If you end up needing to implement something like this in your own application, we could definitely discuss how you can implement it directly in Hibernate Search 6 instead, and if possible get it released in the next betas. If you’re interested, drop by on our chat!

We have 340 entities in our data model, and 10-20 million records (and growing) is typical. For the sake of simplicity, assume that one-third of these are reference tables with very low cardinality, one-third of them are transaction tables with very high cardinality, and one-third of them are master tables with medium cardinality, and the target of a high number of foreign keys. There is usually a path from a given table to any other table in the system, if enough relationships are traversed. Rarely is any table on an island.

In my example, there are reference tables R1 and R2, where R2 changes constantly. R1 includes R2 property R2f2. Transaction table T1 includes property R1f1 from R1. T1 also includes T2, A and B. If reference table R2 changes field R2f1, R1 is reindexed as expected. However, Hibernate Search 5 also reindexes T1 because R1 was reindexed, even though property R2F1 is not included in index T1. Then it reads T2, A and B from the database to reindex T1, even though they have not changed either. This slows down reindexing exponentially, when only R1 and R2 needed reindexing. Indexes for T1, T2, A and B were refreshed but did not actually change.

This is a simplified example, our real world application is actually worse than this. Luckily we have our database on enterprise solid state storage!

If the list of properties contains every property, this is the root problem of our performance issues. Take this a step further and introduce entity C and D embedded in A. In the case where B changes, A, C and D are all reindexed into A even though neither C nor D changed at all. It gets even more unruly when B changes and the changed property isn’t included in any index, yet reindexing is triggerred anyway…

It would be nice if lucene could be instructed to replace the value of only changed properties during reindexing while leaving the rest alone. Then the attributes representing properties from C and D could remain the same, to avoid reading those records from the database, while the changed properties of B are refreshed in A.

Excellent! This will likely solve our present issue. Presently, the automatic indexing is taking a highly unreasonable pathway through almost all of our interconnected entities, and fetching very inefficiently. Batch size does not seem to be considered in Hibernate Search 5, as we’re seeing records being read from the database one at a time, even though thousands of records of the same type need to be retrieved. I will be able to test as soon as we finish porting all of our code to the new API.

This will be huge. Making the end user of the application wait for indexing during their data entry transaction is something we desparately need to get away from to increase throughput and decrease wait times and database contention. Thank you for the invitation to discuss via chat. I will definately discuss this possibility with our clients and stay in touch. I am also interested in HSEARCH-3281 because it would allow us to offload indexing to another server.

It would be terrific to be able to specify which indexes have a higher priority than others. Some entities need to be more realtime, because they are related to transaction processing, while others are indexes of historical data that need to be maintained, but are not as time sensitive.

Thank you!

Right. Don’t get me wrong, in Search 6 we’ll still need to load C and D when we actually need to reindex A. But at least we will only reindex A when we actually need to.

That should only happen when optimization is disabled, which (IIRC) only happens if you rely on class bridges. In Search 6, the new TypeBinder allows you to specify which properties you depend on.

It would be interesting indeed, but rather complex, and I’m not even sure it’s entirely possible. As far as I know, Lucene only allows such “per-field updates” for doc-values, but not for indexes or storage. That means it would only work for fields designed to be sorted/aggregated on, but not searched on nor projected on. Related: HSEARCH-1236

That’s interesting, never heard of that. May be worth investigating, and if you can create a reproducer, opening a ticket?

Even without fully asynchronous indexing, in a closer future, you might be interested in HSEARCH-168. It’s about disabling automatic indexing for specific indexes. Then you can plan mass indexing periodically (e.g. overnight) for these indexes. It’s currently planned for 6.0.0.Final, though I’m not very optimistic about it; we might have to delay it to 6.1.