I am doing some tests to process millions of records with Hibernate using streams to avoid memory exhaustion.
In the test case below I am trying to process 20 million records.
var sql = "select\r\n" //
+ " generate_series as id,\r\n" //
+ " 'abcdefghijklmnopqrstuvwxyz 0123456789' as code,\r\n" //
+ " 'abcdefghijklmnopqrstuvwxyz 0123456789' as description,\r\n" //
+ " 'abcdefghijklmnopqrstuvwxyz 0123456789' as tags,\r\n" //
+ " generate_series as version\r\n" //
+ "from\r\n" //
+ " pg_catalog.generate_series(1, 20000000)";
var em = emf.createEntityManager();
try {
em.getTransaction().begin();
var query = em //
.createNativeQuery(sql, GroupEntity.class) //
.setHint(QueryHints.HINT_FETCH_SIZE, "1000");
query.getResultStream().forEach(o -> {
// do something
});
em.getTransaction().commit();
} finally {
em.close();
}
With this setup, it consumes all available memory and raises an OOM error (I have 4GB of RAM available).
I presume it is because Hibernate is caching all entities in the first level cache.
If I create the query without assigning the result to an entity, it works as expected and the memory consumption is almost constant (100MB).
var query = em //
.createNativeQuery(sql) //
.setHint(QueryHints.HINT_FETCH_SIZE, "1000");
I presume it is because as each result is not tied to an entity, Hibernate doesn’t cache anything.
But I want to get the entity instance instead of an Object[].
So, I used a workaround that solved it.
I manually detach each instance from the persistence context.
This way, the memory consumption is almost constant as well (100MB)
query.getResultStream().forEach(o -> {
// do something
em.detach(o);
});
Is there some recommended (standard) way to acomplish what I am trying to do? (read millions of entities without using too much memory).
It took me some time to find this solution, and it looks like a hack to me.
I presume many people will face the same problem when working with getResultStream().
If there is no way to disable first level caching when working with streams, it seems almost useless to use streams at all, as all entities will be stored in memory anyway.
Best regards,
Fabiano