Out of memory errors when indexing a large entity


#1

I have a large entity, with lots of OneToMany, ManyToOne, and ManyToMany child entities sets (each with almost a thousand child entities).

It’s a HUGE entity, with over 10 GB of database data. EVERYTHING must be indexed (so using includePaths is pointless).

When I try to index this entity, I eventually get out-of-memory errors, as it tries to lazy-load all the thousands of child entities. After about 60 minutes, even after allocating 50 GB of RAM, I still get OOM errors.

Is there some way I can work around this, to index this massive graph of data? Is there some setting where I can tell the Hibernate Search indexer to immediately index the data, as it lazy-loads it, and discard it, to free up memory for the next entities in the lazy-load batch?


#2

There isn’t such feature in Hibernate Search yet.

Let’s take a step back. Are you indexing a single, huge entity, which would would result in an index with a single, huge document?
That unusual, and more importantly doesn’t seem very useful: full-text queries on that index would only return you either a single match or none at all.
If that’s really what you need, could you tell me a bit more about your model, the kind of queries you’re planning to run, the context… ? I might be able to find alternative solutions.

Back to the session clearing, there are two approaches to implementing this:

  1. Either we try to detect, before accessing an entity, whether it’s already loaded in the session or not, and if it’s not, we remember to evict it from the session after having extracted all the necessary data. That would probably be a bit fragile: if we evict an entity that was already in the session when the user started indexing, the user will end up getting strange errors. And since Session eviction can be cascaded, it’s not that easy to evict “just what we need”.
  2. Or we index in a separate session, and simply clear the session periodically, re-attaching entities as necessary after a clear.

There are plans to implement the second solution in Hibernate Search 6, but unfortunately that’s not going to be available for production environments anytime soon.


#3

Thanks for the reply!

I’m not indexing a single entity; rather, I’m indexing hundreds of thousands of different entities, that have embedded entities (OneToMany, ManyToOne, and ManyToMany) up to 4 levels deep.

Hibernate Search is able to index most of these entities (and their embedded children, grandchildren, great-grandchildren) just fine.

The problem is with larger entities, can contain hundreds or even thousands of embedded children, etc. In those cases, I simply run out of memory, when initially indexing them.

I found a VERY ugly solution: I can reduce such entity’s depth to just 1, and instead use includePaths, to explicitly list hundreds of embedded fields.

This solution is nuts though. Surely there should be a way to specify something simply like
public class Parent {

@IndexedEmbedded(depth=1, prefix = “child_”, includePaths={“grandchild_*”, "grandchild_greatgrandchild_*})
@OneToMany(mappedBy = “parent”, fetch = FetchType.LAZY)
private Set children;

}
instead of
public class Parent {

@IndexedEmbedded(depth=1, prefix = “child_”, includePaths={“grandchild_field1”, “grandchild_field2”, …, “grandchild_greatgrandchild_field1”, “grandchild_greatgrandchild_field2”, …})
@OneToMany(mappedBy = “parent”, fetch = FetchType.LAZY)
private Set children;

}


#4

I don’t understand, are you saying that the code you just gave is not equivalent to just @IndexedEmbedded, with no parameter? You tested it and it works?

So you don’t need to index everything after all, right?


#5

Sorry for the conflicting info.

By indexing everything, I mean mostly everything.

I have 6 primary entities that relate, up to depth=4 (some entities only need depths of 1, 2 or 3), using various OneToMany, ManyToOne, ManyToMany relationships.

It would be unwieldy and unmaintainable to explicitly include all field paths, at all needed depths, for every field in every embedded entity.

It would take me days to populate all those includePaths string arrays. And if I change a field, I’d be forced to go on a safari, to update all the includePaths, in all the entities.

To avoid the OOM error, as a proof of concept, I was able to reduce the depths of the joined entities, from depth=4 to depth=1, and manually specify SOME paths that need to go deeper. But, to do this, for possibly thousands of paths, would be insane; it would take me days to calculate and populate the arrays.

If only includePaths allowed a way to specify child entities to index (without explicitly listing all the fields in those entities), it would solve my problem, as I’d be able to write things like:
includePaths={
“entity1”,
“entity2.entity3”,
“entity4.entity5.entity6”
}
instead of an array of thousands of string paths like
includePaths={
“entity1.field1”, …“entity1.field999”,
“entity2.entity3.field1”, …“entity2.entity3.field999”,
"entity4.entity5.entity6.field1, etc…
}


#6

Before we get ahead of ourselves, I’d like to mention two things:

  1. The fact that your proof of concept works does not mean that a full solution where all included paths are configured (either implicitly or explicitly) would work. Your current setup indexes much less data than what you want, and thus is much less prone to OOM.
  2. By default, when you don’t use includePaths, Hibernate Search will not access and index everything. It will index every indexed field in entities targeted by the @IndexedEmbedded. This means using includePaths will only be useful if you configured indexed fields (@Field) in these entities that you don’t want to see included. I’m not sure it is your case?

Now, back to trying to find a solution. What we know so far:

  • Your model includes thousands of fields.
  • The only way we could currently solve the OOM issue would be to reduce the number of embedded fields, as you rightly suggested.
  • Hibernate Search only allows to reduce the number of embedded fields by specifying explicitly all the embedded fields.

The fact that you have so many fields makes me wonder: how do you intend to query the index? You will have to handle these thousands of fields at some point, at it would be as crazy to handle thousands of fields in the queries as in includePaths. I can see three ways that could work:

  1. You have some sort of representation of the field names, and your program uses that to build queries. In short, you have a metamodel.
  2. You never care about the fields and you just query every field most of the time.
  3. The users themselves know about the fields and give you the name of a field when they need to access it

If you have a metamodel (case 1), you may be able to build your Hibernate Search mapping programmatically: write a piece of code that reads your metamodel, and uses Hibernate Search’s programmatic mapping APIs to translate the metamodel into a Hibernate Search mapping, instead of using annotations. Then listing all the included fields explicitly may be an option, provided your metamodel contains the necessary information.

If you never care about the fields and just query every field most of the time (case 2), there’s another option that could work: just give the same name to every field. Pick some neutral name like “text”, and stuff everything in it. Then your includePaths will be dead simple. You will have to take care to keep all the fields consistent, though (same type, same analyzer, …).

If only your users know about the field names, then it’s a bit more problematic… One last option you could investigate would be to look for ways to configure includePaths and depths so as to minimize both useless indexing and verbosity.
For example, you could first work on the depths at every level. Maybe you already did, but I’ll mention it just in case. Depths are composed “smartly” by Hibernate Search, so if your root entity specifies a depth of 4, but an entity at the level just below specifies a depth of 2, the the resulting (absolute) depth will be of 3 for that entity. Thus if you can have a look at @IndexedEmbedded in your non-root entities and configure the depths appropriately, you might already gain some performance.
Similarly, includePaths are also composed “smartly”. You may use unrestricted depth/includePaths at the root, but use includePaths in entities at the level below: the filters will apply. If you have some entities that include cycles, or that only require to add very few fields to @IndexedEmbedded.includePaths, it would be a good idea to start with these entities first, and see how it improves the situation before you invest more time.


#7

Thank you for your detailed response. Your clarification about Hibernate Search “smart” indexing was particularly useful in my case.

I’ve analyzed the includePaths, of one of my problematic root entities, and discovered places, in some child entities, where those depth can be reduced without compromising our users’ abilities to run their queries.
That did the trick. No more OOM errors.
One final clarification: your example about “smart” indexing was for when a root entity had depth 4, but the child had depth 2 (resulting in a “smart” indexing, of depth 3 from root).
Does that apply in reverse? E.g. if root entity had depth 2, and the immediate child entity had depth 2, would that result be an absolute depth of 3?
D


#8

Nice! Glad I could help.

The parent entity has precedence over the children, so this would result in an absolute depth of 2.