HS6 MassIndexer mergeSegmentsOnFinish

Is mergeSegmentsOnFinish in HS6 (6.0.1) performing the same work as optimizeOnFinish in HS5? I’m asking because I am seeing a very large performance delta (elapsed time). These tests were performed on the exact same postgresql-10 database using a Lucene backend. Total index rebuild time on HS 5 - 2 hour, 8 mins, 27 secs @~500 docs/sec. Optimize time ~18 mins. On HS 6 - 2 hours 59 minutes, 25 secs @~472 docs/sec (just a little slower than HS5). mergeSegments time ~63 minutes (this is the pain point)

Any thoughts would be appreciated as this is not the fastest thing to iteratively test :slight_smile:

Thanks, Keith

These are my settings:

HS5:
spring.jpa.properties.hibernate.search.default.exclusive_index_use = true
spring.jpa.properties.hibernate.search.default.worker.execution=sync
spring.jpa.properties.hibernate.search.default.index_flush_interval=2000
spring.jpa.properties.hibernate.search.default.max_queue_length=2000

  int numCores = Runtime.getRuntime().availableProcessors();
  int numThreads = Integer.max(numCores, 1);

  MassIndexer indexer = fullTextEntityManager.createIndexer(PatentDocDO.class, TpsMainDO.class)
     .typesToIndexInParallel(1)
     .batchSizeToLoadObjects(200) 
     .cacheMode(CacheMode.IGNORE) 
     .threadsToLoadObjects(numThreads)
     .idFetchSize(5000) 
     .progressMonitor(new SimpleIndexingProgressMonitor())
     .optimizeOnFinish(true);

HS6:

spring.jpa.properties.hibernate.search.automatic_indexing.synchronization.strategy=write-sync

  int numCores = Runtime.getRuntime().availableProcessors();
  int numThreads = Integer.max(numCores, 1);

  MassIndexer indexer = searchSession.massIndexer(PatentDocDO.class, TpsMainDO.class)
     .typesToIndexInParallel(1)
     .batchSizeToLoadObjects(200) 
     .cacheMode(CacheMode.IGNORE) 
     .threadsToLoadObjects(numThreads)
     .idFetchSize(5000) 
     .mergeSegmentsOnFinish(true);

Hello,

Hibernate Search 6 writes to the index from multiple threads, on contrary to Hibernate Search 5. I haven’t checked the details of what Lucene’s IndexWriter does in that case exactly, but I suppose it’s possible that this strategy leads to more fragmented indexes, and thus longer merges.

63 minutes, though… that’s a lot.

So, let’s have a look.

Do you really need to merge?

First things first: check that you actually need to merge segments after mass indexing.

In the old days of Lucene, merging segments used to be something you just had to do. Nowadays, Lucene has very advanced automatic merging behaviors and will try to split your index into an ideal number of segments, depending on the index size and the ratio of deleted documents for each segment. And it will do this as you write to the index, so there’s usually no need to ask Lucene to merge segments after mass indexing: they should already be merged in an appropriate number of segments.

By asking the mass indexer to merge segments on finish, you’re forcing Lucene to merge everything into a single segment, which is rarely a good idea, especially for very large indexes. More on this here.

So, I urge you to try not calling mergeSegmentsOnFinish(true) and see what happens.
Does the performance of your application stay the same, or improve, if you don’t merge segments after mass indexing? Then you don’t need to merge segments.
Does it decrease? Then maybe you still don’t need to call mergeSegmentsOnFinish(true), but should rather tweak the merge settings so that segments are merged the way you need, but on the fly when writing instead of at the end of mass indexing. These settings apply to all indexing, not just mass indexing.

Apply the same settings in 6 as you do in 5

I see these two settings in your Search 5 config:

spring.jpa.properties.hibernate.search.default.index_flush_interval=2000
spring.jpa.properties.hibernate.search.default.max_queue_length=2000

To get the exact same config in Search 6, you need to add this:

spring.jpa.properties.hibernate.search.backend.io.commit_interval=2000
spring.jpa.properties.hibernate.search.backend.indexing.queue_size=2000

See the migration guide for the new name of every single configuration property.

To get the exact same behavior as Search 5 (single queue, no parallel indexing), you can also try the following. However, I’d recommend trying this separately, as a second step, because it may decrease performance.

spring.jpa.properties.hibernate.search.backend.indexing.queue_count=1

More?

If the above didn’t solve your problems… You’ll need more information.

You can ask the mass indexer to only index the first N entities by calling limitIndexedObjectsTo(N). See mass indexer parameters. Try indexing a fraction of your data, see if you’re getting the same difference in execution time (proportionally). If so, you can experiment with this fraction of your data, and then iterating becomes feasible.

You can enable Lucene’s infostream to get a (very detailed) trace of what Lucene is doing exactly. This will help in finding out what’s going on during the merging, in particular. However, the infostream is very, very verbose… So just writing it to the logs may impact performance. Be careful when comparing execution times after you enabled the infrostream.

Once you find out what’s different exactly, you can change some Lucene index writer settings or Lucene merge settings. If your goal really is to merge all segments into one (which, as I said above, probably isn’t a good idea), you’ll probably want to tune the merge settings to make sure that Lucene doesn’t create too many segments while writing.

Update… using my settings for HS 5 results in worse performance overall… 3 hours 11 minutes and 25 seconds… with a merge time of 75 minutes and 28 seconds.

Using spring.jpa.properties.hibernate.search.backend.indexing.queue_count=1

Crushes overall performance by 30%.

Not merging segments on finish (using default settings) results in 25% impact to search times.

Will run more tests and reply later.

After a number of tests, I think I have settled on setting mergeSegmentsOnFinish to false and going with

hibernate.search.backend.io.merge.factor=2

Interestingly, this does not seem to impact MassIndexer performance a significant amount. In fact, it many runs of 500K docs, the indexer has completed more quickly than using the default value of 10. The number of .doc files in the index is reduced from 10 to 7 running with merge factor of 2. Frankly, I thought the delta would be more than this, but from everything I’ve read using a low merge factor should result in better read (search) performance over time and I can see some improvement over the default settings. Search performance is not as good as merge on finish (which does result in a single .doc file in the index), but that performance is apparently only temporary. I’m going to kick a full run on the full staging database I have with 3.3 million docs (production actually has 12 million) and see how that behaves. Settings for the run in addition to merge factor above:

  int numCores = Runtime.getRuntime().availableProcessors();
  int numThreads = Integer.max(numCores, 1); 

  MassIndexer indexer = searchSession.massIndexer(PatentDocDO.class, TpsMainDO.class)
     .typesToIndexInParallel(1)
     .batchSizeToLoadObjects(25) //10 is the default setting
     .cacheMode(CacheMode.IGNORE) 
     .threadsToLoadObjects(numThreads) //Should be at least number of cores
     .idFetchSize(150) //Tried setting this to 0 to cause load all in postgres - degraded performance a lot.
     .dropAndCreateSchemaOnStart(true) // No searches possible during reindex as schema dropped, but upgrade guaranteed
     .mergeSegmentsOnFinish(false); //This is the default, but explicitly noting that this is turned off

OK, this thing truly baffles me. I’ve run the MassIndexer 3 times (back to back) with the same settings (mergeSegmentsOnFinish=false) on the same database. I am setting one property

hibernate.search.backend.io.merge.factor=2

This is the lowest merge factor Lucene allows and it definitely impacts throughput (ignore my previous comment on this). On merge factor 10, I am getting 486 docs/sec… compare that with what is below. Also, the overall duration, even with the merge at the end still only comes out to 25 minutes 21 seconds which is slightly better than what I am seeing below.

Run 1: 27mins, 31 secs @304 docs/sec
23 .si files; 8 .cfs files; 15 .doc files

Run 2: 28 min, 36 secs @293 docs/sec
19 .si files; 8 .cfs files; 11 .doc files

Run 3: 26 min, 21 secs @317 docs/sec
26 .si files; 13 .cfs files; 13 .doc files

I do not understand this inconsistency. Note to anyone reading this… when the job is complete, it is not. Check top/iostat and you will see it keep working for 5 or so minutes (remaining background merges I’m assuming). If anyone understands why I end up with a different number of segments with the same data, I would love an explanation.

If the former is not more controllable, then my plan is to run mergeSegmentsOnFinish, but I don’t want just a single segment as it would likely be a 32+ GB file in production. I set this

hibernate.search.backend.io.merge.max_forced_size=500

as an experiment on 500K docs and it appears to work.

In this way you can control the initial number of segments you have to work with after a re-index which you can schedule as needed. Are there any gotchas with this approach that I need to be aware of?

Thanks in advance.

Most likely, yes. Merges happen in a dedicated thread, spawned every time Lucene needs to perform a merge.

I’d assume merges are still in progress? Wait until the threads with “Lucene Merge Thread” in their name disappear, then you should have the expected amount of segments.

Note that the application is usable before merges are complete. You can search, write to the index, … Anything. There’s just some background merge in progress, but that could happen at any time in the normal operation of your application, not just after mass indexing. You can totally consider mass indexing complete before the (background) merges complete.

I don’t have much hands-on experience with these settings, but this sound fine to me.

If you need more useful insight, you can also ask the Lucene community: Apache Lucene - Lucene™ Mailing Lists and IRC

Regarding throughput… be aware that quite often, the bottleneck for mass indexing is loading entities, not writing them to the Lucene index. That’s because indexing a single entity often involves loading multiple associations on top of the “root” entity.

So if, later, you want to improve mass indexing speed, you might want to look into optimizing the loading of entities, rather than the Lucene configuration. You can find some general information there: Hibernate Search 7.0.0.Final: Reference Documentation . Another thing that may help is enabling batch fetching of collections.