Issues indexing large dataset with mass indexer

We have used the mass indexer for a long time now with no issues. We are using the ElasticSearch implementation. We are in the process of scaling our data, and have started to notice some issues with the mass indexer. Before the holidays we had around 15m records, which would index in around 5 hours. Over the holidays we pulled in an additional 10m records and now the mass indexer causes out of memory exceptions and other issues.

From going through the debugger it appears like it is loading all primary keys (a unique 8-12 character string) into memory. To my understanding this should only be about 250Mb but JConsole is showing it at almost 3Gb of memory used. It takes about 10 minutes to get all of these, with a lot of GC runs in between. After that it starts indexing with the small amount of memory left. If we run it on our cloud environment it fails to index any, and normally just crashes after 10-20 minutes with an OOO exception.

Is there anyway we can get around having to load all of the primary keys into memory, as this will just become more of a problem as we scale further.

The call to the MassIndexer for reference

fullTextEntitymanager.createIndexer(Object.class).typesToIndexInParallel(1).batchSizeToLoadObjects(75).cacheMode(CacheMode.NORMAL).threadsToLoadObjects(32).idFetchSize(150).progressMonitor(indexMonitor).startAndWait();

We have also tried setting all of the parameters to 1, but the same issue occurred.

Hibernate: 5.4.24.Final
Hibernate Search: 5.10.4.Final

Hello,

The MassIndexer definitely doesn’t try to load every IDs first, then process them: it does both in parallel. Identifiers are not loaded all at the same time, but rather by batches of 150 IDs, with a limit of 1000 batches. So, at any time, you won’t have more than 150,000 IDs in memory. Also, batches of IDs are processed as they are fetched, in a parallel thread.

I’m not sure where your OOM exception comes from, but I’d be surprised if it was related to IDs.

The fact that you’re still getting an OOM with all parameters set to 1 is interesting… Obviously that’s not a useful configuration as it will be super slow, but it definitely should have a low memory footprint.

There may be a problem with your mapping? Some eager or @IndexedEmbedded association that, in your new data, happens to point to thousands of entities, and thus triggers a lot of database operations even though just one entity is being indexed?

I suggest you check the SQL queries being run, either through Hibernate ORM logs or the logs of your DBMS. If one single query is returning thousands of rows, that’s your problem right there.

Another way to tackle the issue is to take a memory dump and use a specialized tool to try to infer what is taking up so much memory. But it will probably be hard to make sense of all this data.

The reason I mentioned the loading them all into memory is there is a query “select this_.object_id as y0_ from object_base this_” that is ran within mysql-connector-java. This is the method that takes around 10 mins and during which the memory climbs to around 3gb. It returns a ResultSet, within that is a ResultsetRowsStatic object and within that is an ArrayList with 25 million objects. Each object appears to contain an 8-12 byte array, the values of which seem to be the decimal value of each character in the ID when using UTF-8 encoding. This always happens before any data is indexed, no matter the parameters set in the MassIndexer, and after this indexing continues as normal, at least when ran on my local machine. What I find interesting is this should only equate to roughly 250mb of data, but either way is concerning.

As for the mapping its possible. The mapping isn’t great (old code we have taken over) and the data we pulled over xmas was several years old, so could have caused issues not seen in the newer data. I can take a look at the SQL and go from there?

Thanks once again for your speedy help on these issues, really appreciate the time and effort :slight_smile:

Ah… MySQL and fetch size. A classic.

So, the thing is, we’re using scrolling to gather the IDs. We run a single query, but fetch rows by pages of 150 elements. With all databases and JDBC drivers, this works fine.

But… not with MySQL. The MySQL JDBC driver does not allow to select how many rows to fetch. You can set the JDBC fetch size to 1000, 150 or 1, it does not matter: it will always pre-fetch all rows immediately. Resulting in the crash of your application.

The javadoc of idFetchSize actually mentions this problem and offers the solution:

	/**
	 * Specifies the fetch size to be used when loading primary keys
	 * if objects to be indexed. Some databases accept special values,
	 * for example MySQL might benefit from using {@link Integer#MIN_VALUE}
	 * otherwise it will attempt to preload everything in memory.
	 * @param idFetchSize the fetch size to be used when loading primary keys
	 * @return {@code this} for method chaining
	 */
	MassIndexer idFetchSize(int idFetchSize);

So, try this:

fullTextEntitymanager.createIndexer(Object.class)
        .typesToIndexInParallel(1)
        .batchSizeToLoadObjects(75)
        .cacheMode(CacheMode.NORMAL)
        .threadsToLoadObjects(32)
        .idFetchSize(Integer.MIN_VALUE) // <= HERE
        .progressMonitor(indexMonitor)
        .startAndWait();

You have once again saved the day! Sorry for not seeing that in the JavaDoc, and thanks for the quick help! As an added benefit we are now building at 50-100% quicker because of the memory that has been free’d up!

1 Like

No problem, glad I could help :slight_smile: