Processing millions of records with query.getResultStream()

I am doing some tests to process millions of records with Hibernate using streams to avoid memory exhaustion.
In the test case below I am trying to process 20 million records.

		var sql = "select\r\n" //
				+ "	 generate_series as id,\r\n" //
				+ "	 'abcdefghijklmnopqrstuvwxyz 0123456789' as code,\r\n" //
				+ "  'abcdefghijklmnopqrstuvwxyz 0123456789' as description,\r\n" //
				+ "  'abcdefghijklmnopqrstuvwxyz 0123456789' as tags,\r\n" //
				+ "	 generate_series as version\r\n" //
				+ "from\r\n" //
				+ "  pg_catalog.generate_series(1, 20000000)";

		var em = emf.createEntityManager();
		try {
			em.getTransaction().begin();

			var query = em //
					.createNativeQuery(sql, GroupEntity.class) //
					.setHint(QueryHints.HINT_FETCH_SIZE, "1000");
			query.getResultStream().forEach(o -> {
				// do something
			});

			em.getTransaction().commit();
		} finally {
			em.close();
		}

With this setup, it consumes all available memory and raises an OOM error (I have 4GB of RAM available).

I presume it is because Hibernate is caching all entities in the first level cache.

If I create the query without assigning the result to an entity, it works as expected and the memory consumption is almost constant (100MB).

			var query = em //
					.createNativeQuery(sql) //
					.setHint(QueryHints.HINT_FETCH_SIZE, "1000");

I presume it is because as each result is not tied to an entity, Hibernate doesn’t cache anything.

But I want to get the entity instance instead of an Object[].

So, I used a workaround that solved it.
I manually detach each instance from the persistence context.
This way, the memory consumption is almost constant as well (100MB)

			query.getResultStream().forEach(o -> {
				// do something
				em.detach(o);
			});

Is there some recommended (standard) way to acomplish what I am trying to do? (read millions of entities without using too much memory).

It took me some time to find this solution, and it looks like a hack to me.
I presume many people will face the same problem when working with getResultStream().

If there is no way to disable first level caching when working with streams, it seems almost useless to use streams at all, as all entities will be stored in memory anyway.

Best regards,

Fabiano

If you select scalar rows (no collection joins), it would be better to clear the Session/EntityManager after X number of results. Assuming your driver responds appropriately to the 1000 row fetch hint, X=1000 would be a good start:

MutableInteger i = new MutableInteger();
query.getResultStream().forEach(
        (o) -> {
            // do something

            if ( i.incrementAndGet() > 1000 ) {
                em.clear();
            }
        }
);

NOTE : I use a custom MutableInteger rather than Java’s AtomicInteger because it has a worse, and here completely unnecessary, performance overhead.

But other than that, yeah, that would be the gist.

Thanks for your response.

Yes, I think this is the way to go.

And I presume you meant to clear the cache after each X number of results (your example clears the cache just once).

PS: I usually create a single element array when I need to have something final and mutable to use inside a lambda.

int[] count = new int[] { 0 };
query.getResultStream().forEach(o -> {
	// do something
	if (++count[0] % 1000 == 0)
		em.clear();
});