Out of Memory Exception on creating initial Indexing with Mass Indexer

We have started POC to implement Hibernate Search in our current transaction banking project.
Main entity which is used for data persistence is quite heavy with multiple associations.
Find on this entity generates a select query with 828 fields in select clause, 30 tables in from clause.
This table contains 8.5 million records.

Now, when I tried to create initial indexing using Mass Indexer, it results in Out of Memory exception with 2GB of heap space allocation to application.

Its not even able to index 5000 records.

fullTextEntityManager.createIndexer(ReceivableInstructionMain.class)
.batchSizeToLoadObjects( 5 )
//.cacheMode( CacheMode.NORMAL )
.threadsToLoadObjects( 5 )
.idFetchSize( 15 )
.transactionTimeout( 1800 )
.limitIndexedObjectsTo(5000)
//.progressMonitor( monitor ) //a MassIndexerProgressMonitor implementation
.startAndWait();

It seems that its not flusing documents to the lucene index directly until complete process is not completed.

How to create initial indexing with this much of data?

The mass indexer is known to work with large datasets; that’s its main purpose. So the problem should be in the mass indexing settings, in the ORM mapping (sub-optimal configuration), or in the Search mapping itself that recursively indexes associated entities and generates enormous documents.

From the settings you gave, Hibernate Search should only load at most 5 * 5 = 25 entities in memory simultaneously. It is strange that such a small number of entities would trigger a an OOM exception… So I would go with a problem with either the ORM mapping or the Search mapping.

Some questions:

  1. Which version of Hibernate ORM are you using?
  2. Which version of Hibernate Search are you using?
  3. When you say there are 828 fields, you mean when fetching only the data of this entity? Or when fetching this entity and associated entities? I’m asking because one entity may spread over multiple tables, but 30 seems a lot.
  4. Exactly how large are your entities? Can you give a rough estimate of how many objects should be loaded in memory to access all the indexed properties of an entity? If it’s a million, then we have a problem, but if it’s only a few dozens then you should be fine, especially with your current settings.
  5. Do you use eager loading on some associations? Do you have to, or could you switch to lazy associations with batch fetching in at least some cases (using Hibernate ORM’s @BatchSize annotation or setting hibernate.default_batch_fetch_size)?
  6. What do you index: only some properties, or basically everything?
  7. How extensively do you use @IndexedEmbedded? Do you use @IndexedEmbedded.includePaths to limit embedding to only what’s necessary?
  8. Do you have unusually large data in your entities, such as CLOBs/BLOBs?
  9. Did you try to enable logging for org.hibernate.search.batchindexing.impl.SimpleIndexingProgressMonitor, so as to display the progress in logs before it crashes? Can you give us the result?

Could you show us you entity model? It would be much easier to spot problems that way. From just this mass indexer execution code, I can only tell you that the problem doesn’t seem to be there.

Hi Yrodiere,

Please find below answers for your queries:

Which version of Hibernate ORM are you using? - hibernate-core-5.2.14.Final
Which version of Hibernate Search are you using? - hibernate-search-orm-5.9.1.Final
When you say there are 828 fields, you mean when fetching only the data of this entity? Or when fetching this entity and associated entities? I’m asking because one entity may spread over multiple tables, but 30 seems a lot. - Fetching data from this entity and associated entities
Exactly how large are your entities? Can you give a rough estimate of how many objects should be loaded in memory to access all the indexed properties of an entity? If it’s a million, then we have a problem, but if it’s only a few dozens then you should be fine, especially with your current settings. - 30 tables in from clause of query means 30 objects , not millions
Do you use eager loading on some associations? Do you have to, or could you switch to lazy associations with batch fetching in at least some cases (using Hibernate ORM’s @BatchSize annotation or setting hibernate.default_batch_fetch_size)? - All One to One associations are eager loading
What do you index: only some properties, or basically everything? - To start with , indexing few properties of main entity
How extensively do you use @IndexedEmbedded? Do you use @IndexedEmbedded.includePaths to limit embedding to only what’s necessary? - To start with , @IndexedEmbedded not used
Do you have unusually large data in your entities, such as CLOBs/BLOBs? - No CLOBs/BLOB’s are there
Did you try to enable logging for org.hibernate.search.batchindexing.impl.SimpleIndexingProgressMonitor, so as to display the progress in logs before it crashes? Can you give us the result? - Progress was 19% when it crashed with 10000 records set for indexingIndexing%20Progress%20When%20Crashed

No document was flushed to directory when OOM exception occurred.
No%20Document%20Flushed%20when%20OOM%20Exception%20Occurred

When Java process was killed, then document got flushed
Document%20Flushed%20when%20Java%20Process%20was%20killed

It may have, if you had to-many associations or if you were using @IndexedEmbedded. But since you’re not, this is off the table. Each entity should indeed load a 30-object graph at most.

Now, that’s really weird. Using more than 2GB of heap space for 5 * 5 = 25 entities loaded in parallel, each of them loading a graph of at most 30 entities… There must be a leak somewhere.

Do you do anything else in this JVM? You are not using an in-memory database like H2 for your tests, right?

I will not be available next week. In the mean time, here are some suggestions:

Inspect the heap

I would recommend using Java Mission Control to determine how exactly your heap ends up that large. In our own performance tests and in other real-world applications with fairly complex entity models, 2GB was always more than enough… There must be something specific going on.

Set up a reproducer

If you still need help, a reproducer will probably be necessary. If the problem is in Hibernate Search itself, just copy/pasting your mapping to a very simple project and running the mass indexer should allow to to reproduce the problem.

The main issue will be to fill the database with enough data before running mass indexing… You could start with your own database, see if mass indexing still fails. Then try to populate another database with a few thousand automatically generated objects, and see if it still happens.

If you decide to try to build a reproducer, this test case template might help: hibernate-test-case-templates/search/hibernate-search-lucene at master · yrodiere/hibernate-test-case-templates · GitHub

I have managed to index data with 16000 records and it is working fine now. Will try with 8.5 million records also.

It started when below configuration was used:
fullTextEntityManager.createIndexer(ReceivableInstructionMain.class)
.batchSizeToLoadObjects(5)
.threadsToLoadObjects(5)
.idFetchSize(15)
.transactionTimeout(1800)
//.limitIndexedObjectsTo(3000)
.start();

Also, eclipse console buffer was set to limited which was unlimited previously and showSQL was turned off.

Hi yrodiere,

As i mentioned previously, entity which we are indexing is a complex one. Find on this entity generates a select query with 828 fields in select clause, 30 tables in from clause.

When we search for records in the indexed data, find query is generated to fetch actual data from DB which specifies these 828 fields in select clause but I only require 25 fields to be fetched.

Is there any way to specify which columns to be fetched from DB for the records matched within indexed data and not run find query?