Hello,
Hibernate Search 6 writes to the index from multiple threads, on contrary to Hibernate Search 5. I haven’t checked the details of what Lucene’s IndexWriter
does in that case exactly, but I suppose it’s possible that this strategy leads to more fragmented indexes, and thus longer merges.
63 minutes, though… that’s a lot.
So, let’s have a look.
Do you really need to merge?
First things first: check that you actually need to merge segments after mass indexing.
In the old days of Lucene, merging segments used to be something you just had to do. Nowadays, Lucene has very advanced automatic merging behaviors and will try to split your index into an ideal number of segments, depending on the index size and the ratio of deleted documents for each segment. And it will do this as you write to the index, so there’s usually no need to ask Lucene to merge segments after mass indexing: they should already be merged in an appropriate number of segments.
By asking the mass indexer to merge segments on finish, you’re forcing Lucene to merge everything into a single segment, which is rarely a good idea, especially for very large indexes. More on this here.
So, I urge you to try not calling mergeSegmentsOnFinish(true)
and see what happens.
Does the performance of your application stay the same, or improve, if you don’t merge segments after mass indexing? Then you don’t need to merge segments.
Does it decrease? Then maybe you still don’t need to call mergeSegmentsOnFinish(true)
, but should rather tweak the merge settings so that segments are merged the way you need, but on the fly when writing instead of at the end of mass indexing. These settings apply to all indexing, not just mass indexing.
Apply the same settings in 6 as you do in 5
I see these two settings in your Search 5 config:
spring.jpa.properties.hibernate.search.default.index_flush_interval=2000
spring.jpa.properties.hibernate.search.default.max_queue_length=2000
To get the exact same config in Search 6, you need to add this:
spring.jpa.properties.hibernate.search.backend.io.commit_interval=2000
spring.jpa.properties.hibernate.search.backend.indexing.queue_size=2000
See the migration guide for the new name of every single configuration property.
To get the exact same behavior as Search 5 (single queue, no parallel indexing), you can also try the following. However, I’d recommend trying this separately, as a second step, because it may decrease performance.
spring.jpa.properties.hibernate.search.backend.indexing.queue_count=1
More?
If the above didn’t solve your problems… You’ll need more information.
You can ask the mass indexer to only index the first N entities by calling limitIndexedObjectsTo(N)
. See mass indexer parameters. Try indexing a fraction of your data, see if you’re getting the same difference in execution time (proportionally). If so, you can experiment with this fraction of your data, and then iterating becomes feasible.
You can enable Lucene’s infostream to get a (very detailed) trace of what Lucene is doing exactly. This will help in finding out what’s going on during the merging, in particular. However, the infostream is very, very verbose… So just writing it to the logs may impact performance. Be careful when comparing execution times after you enabled the infrostream.
Once you find out what’s different exactly, you can change some Lucene index writer settings or Lucene merge settings. If your goal really is to merge all segments into one (which, as I said above, probably isn’t a good idea), you’ll probably want to tune the merge settings to make sure that Lucene doesn’t create too many segments while writing.