Potential memory leak in QueryInterpretationCache

Hi,

recently our Hibernate version was updated to 6.2.7.Final. Parameters like hibernate.query.plan_cache_enabled, hibernate.query.plan_cache_max_size, hibernate.query.plan_parameter_metadata_max_size were not specified, so I assume default ones were in effect.
We observed higher memory usage leading to OOMs in production environment. Heap dump analysis indicated a lot of objects associated with QueryInterpretationCache weren’t garbage collected (screenshot #1).

E. g. you can see almost 400k LIRSHashEntry objects consuming estimated 2G of heap. There’s also similar number of AliasToBeanResultTransformer and SelectInterpretationsKey.

Further analysis showed that number of top-level (contained in table fields of Segment class) LIRSHashEntry was indeed below default bound (2048). However many of them consisted a head of linked list which contained thousands of entries via next field (screenshot #2).

All of linked entries had the same hash and contained exactly the same query. Their keys were SelectInterpretationsKey instances. Their state field was set to HIR_NONRESIDENT, which I believe is an indication that entry was evicted from BoundedConcurrentHashMap. Most of 400k entries have similar state (screenshot #3).

I couldn’t reproduce this behaviour in local environment, so maybe it’s linked with higher volume and variety of queries in production app.

I’d be grateful for any suggestion on what’s happening here and how we can make the cache respect the entries limit.

I think you can track [HHH-13345] - Hibernate JIRA for updates on this matter.

We ran into the same issue:

  • OOM error in live
  • Hibernate version 6.5.3.Final
  • memory dump shows BoundedConcurrentHashMap using 57% of the memory
  • seemingly endless nested LIRSHashEntry objects for the same native SQL query
  • state field was set to HIR_NONRESIDENT

We don’t have a max size set but the number of LIRSHashEntry objects of 80k exceeds the default of 2048.

The hash code is identical between the cache entries. That is true for the hash on the key (org.hibernate.query.sql.spi.SelectInterpretationsKey) as well as the hash on the cached item itself (org.hibernate.internal.util.collections.BoundedConcurrentHashMap.LIRSHashEntry):

A new query plan was added to the cache each execution. This was observed while debugging the following stack which the root cause being that the equals method evaluates to false between the new key and the existing cache key:

org.hibernate.query.sql.internal.NativeQueryImpl#doList
org.hibernate.query.sql.internal.NativeQueryImpl#resolveSelectQueryPlan
org.hibernate.query.internal.QueryInterpretationCacheStandardImpl#resolveSelectQueryPlan
org.hibernate.internal.util.collections.BoundedConcurrentHashMap#get
org.hibernate.internal.util.collections.BoundedConcurrentHashMap.Segment#get
org.hibernate.query.sql.spi.SelectInterpretationsKey#equals
org.hibernate.transform.AliasToBeanResultTransformer#equals

The AliasToBeanResultTransformer equals method compares the aliases:

return resultClass.equals( that.resultClass )
			&& Arrays.equals( aliases, that.aliases );

The problem is that aliases is null (new key) but that.aliases is not (existing key).
The aliases are initialised in a call to transformTuple(…) which Hibernate calls after adding it to the cache. Which affects the equals method after the cache entry has been added.

Our long term solution is to migrate away from the AliasToBeanResultTransformer and use org.hibernate.query.spi.AbstractQuery#setTupleTransformer instead.

As a short term fix, we attempted to set query.setQueryPlanCacheable(false) but this option is ignored by native queries.
Instead, we extend AliasToBeanResultTransformer and override the equals method to ignore the aliases.

To me it seems there are three probably connected problems:

  1. query plan cache growing until OOM crash
  2. cache allows adding duplicate entries for the same query
  3. QueryPlanCacheable flag ignored for native queries

@Sanne would this (and HHH-13345) be relevant to the “internal cache” improvements you’ve been working on?

Thanks for the detailed report. The problem is known and tracked already as HHH-19560