HS6 corrupting index after merge of detached entity with @OneToMany association

When a detached entity is passed to EntityManager.merge() in order to cause an update, HS6 no longer performs an automatic EntityManager.refresh() in order to load associations during indexing. After commit, the index becomes corrupted from what it was before the update happened. The null association properties are treated as empty, which erases the good values which had been indexed previously.

There’s no call to refresh() in the Hibernate Search 5 codebase, so that must be something else…

Could you please open a JIRA ticket with a reproducer?

Yes, it’s the effect / concept of refresh() I wanted to convey, not the specific call. I used a refresh to workaround the issue for one entity today, but I have dozens more which are still unfixed.

I don’t know how HS5 handled this specifically, all I can say is that it was working fine in HS5, and now our application is riddled with index corruption issues across the entire business system. Every merge is deleting valuable search data from the index until I can either work around it for every update, or get it fixed in the framework.

The way I modified our application to fix one of the most critical entities for a bug like this to affect was to add a flush() to force the merge to the database, then a call to refresh(), and then a call to add the entity to the index plan again (this last step might not have been necessary).

Here’s the use case. When entities are loaded and need to be sent over the network to a client application, only those attributes which are necessary for the view are populated. This would include all primitives, all ManyToOne properties, but almost no Collection properties can be “followed” during serialization (especially the large ones), otherwise the entire database would eventually be loaded into memory and sent over the wire. The unnecessary OneToMany collection properties are left null and the object is sent to the client.

When the client sends the entity back with changes, they are merged into the database correctly, but the indexing process goes no further when it finds the null OneToMany property values. None of the IndexedEmbedded relations are brought in unless they happen to be already loaded, or are found to be a “hot” proxy.

Our code is not refreshing every entity which is indexed, because Hibernate Search 5 didn’t require it. It was somehow doing it’s own refresh, or didn’t rely on the properties at all in order to bring in IndexedEmbedded collection data. Now indexing is so lean that unpopulated entities are being used as if they are already prepared for indexing, which of course they are not.

Okay, I will work on that.

Thanks. I understand, and it does look like a bug. The question is where does it come from, and that’s not obvious as there doesn’t seem to be any specific code in Search 5 to handle this. It may even be caused by another change in your application (ORM upgrade, use of newer bytecode enhancement features…)

So, I definitely need a reproducer.

Thanks a lot!

Yes, I agree, even a slight change to the merge behavior, such as not to swap in a proxy where it once did (hypothetically) during merge would be all it would take to mess up everything downstream. On the other hand, if this were the case, we’d probably be seeing a substantial increase in NullPointerExceptions anywhere we previously could rely on having a proxy.

We did move from Hibernate ORM 5.4.15 to version 5.5.6 when we integrated Hibernate Search 6. Javassist went from 3.24 to 3.27.

It seems like there have been times over the years when refresh() and find() would not provide live/hot proxies for initialization right after a persist or merge, without a prior call to evict() and/or refresh(). Then later we noticed that refresh() seemed to get smarter, i.e. it started doing a more thorough job of fixing up a detached instance.

At times I have wished that find(), persist() and merge() would always convert every null relation to a proxy, so I could remove a lot of boilerplate code, but then I got used to using evict() / refresh() or evict() and find() just about every time I would need to traverse a relationship next…

Of course the query cache and second level object cache all had their own respective roles to play in how this functionality would return results in a performant manner. This probably varied as well if we had changed configuration.

Hibernate is always getting better and better, so hopefully whatever change we need to make will stay the same for a while.

I’d hate to have to hit the database every time, just to get live proxies, but again, the second level cache will probably help that perform well if we need to do that to get the system back to normal, even in the short term.

I understand why this is happening now. In Hibernate Search 5 and prior, the gathering process for indexing used to happen very late, as the transaction was being committed. In Hibernate Search 6, there is a buffering mechanism that collects the data needed for document building much earlier, during flush.

The problem with this is that a merge + flush does not cause a refresh, and no proxies are added. Detached entities that have been updated must be reattached to the session with merge and refreshed before there are proxies in place to trigger the loading of additional data from the database while indexing is trying to traverse relationships.

However, if a flush occurs before the object is refreshed, then Hibernate Search 6 takes a copy too soon, and will not be able to traverse the tree properly. If a refresh is used before the flush, then obviously the changes would be overwritten.

Therefore, having changed the timing of when data is gathered for indexing, plus the fact that a buffer is now used instead of the entities themselves, Hibernate Search 6 has caused our application to stop indexing properly after a merge of a detached entity, because the snapshot taken does not yet contain proxies.

At the tail end of the transaction, our application refreshes all updated entities which started off as a detached entity coming in with changes from the client, so that the server can reinitialize any relationships needed by the client to their current values. This was always happening before indexing in Hibernate Search 5, but is now happening after a copy has already been cached into the new buffer with Hibernate Search 6 without proxies in place. Indexing can no longer take advantage of the additional proxies which we add during this refresh like it was before, unless a second copy of every refreshed entity is appended to the buffer manually by the application (which also causes duplicate (but at least correct) indexing).

This will end up being the solution for us to restore indexing of all updated entities: to add a second copy (a manual re-index) of every entity that is merged after it has been refreshed.

The disadvantage of this is that we don’t need another trip to the database at all. After all, we already have the most current version of the entity, since we just merged it!

Thus far, I haven’t found a way to get in between the ORM flush and the HS6 buffering where I can safely add a refresh to proxy all associations before the first buffering attempt.

A refresh requires a prior flush, but indexing requires a prior refresh, so we have a chicken-and-the-egg scenario.

If Hibernate Search 6 is going to continue with this design, it would be nice if ORM could add the proxies for all null associations during merge (or if Hibernate Search would do the same before buffering). That way the proxies would be there to cache into the new buffer on the first flush, rather than requiring a second copy.

Hibernate Search 6 could also just load the associations the way ORM does when a proxy is triggered while the session is open, without having a proxy at all.

Otherwise, maybe the buffering could be moved back to the very end during commit, where it was before?

Definitely not by default, because that used to cause all kinds of problems in much more common scenarios. Maybe we could add an option to disable buffering on flush, but that’s one more setting that could affect potentially everything in automatic indexing. We already have hundreds of automatic indexing tests, so I’d rather not add so much maintenance burden.

So, let’s find another way.

Are we talking about a JPA refresh, or some manual operation you are doing?

To be honest, I’m a bit surprised that the entity passed to Hibernate Search by Hibernate ORM is, for all intended purposes, a detached entity. Either there is something wrong in how ORM triggers events for merged entities, or there is something wrong with the data we’re getting passed (associations pointing to detached entities… somehow?).

Did you manage to create a reproducer?

Yes indeed, I reproduced the issue on both HS5 and HS6, which surprised me. I didn’t expect the issue to happen in HS5, but then found out everything I just shared with you about why it did… The test case templates were beautiful. I will share them with you later today.

Yes, we designed our interceptor to prevent exactly this, the problems you’re referring to. In the test cases, our interceptor was not installed in either test case, so that opened up the issue to HS5 as well.

Yes, that’s exactly what they are, even after merge/flush. We have merge / flush / refresh cadences all over our application to make it possible for us to traverse relationships after a merge, to counteract this. It sounds like your buffering mechanism is also solving the same issue we are with our interceptor, for folks who didn’t write an interceptor and pre-process-and-proxy their entities for search and/or serialization.

We have a finely tuned routine that traverses the hierarchical tree to plug in fully functional object from the session, or a proxy where a “dead” or null relationship exists, then after the session is closed, it plucks out any proxies left over to prevent LazyInitializationExceptions during serialization.

They’re just null, and they stay that way after merge. The client doesn’t need them, some sets are huge so we don’t load them or send them over the wire, just for them to be sent back, but during change of the parent object, these null arrays must be proxied, initialized and indexed. These are just OneToMany I’m working with. We populate every ManyToOne on every object to a specified depth in the tree so any object that the client can update can be sent back and merged correctly. We stop at a certain point which is programmable based on the complexity of the UI.

At the risk of stating the obvious, the merge operation returns an object, which in some (all?) cases can be different from the object you passed in argument. That object should (IMO – I didn’t check the JPA spec) behave as any managed entity when it comes to lazy loading.

Do I understand correctly that you continue using the original (detached) object after the merge, not the actual managed entity returned by the merge operation? That could be the source of all your problems…

But still, I don’t understand why Hibernate Search itself ends up manipulating those detached objects… It gets the entities from Hibernate ORM, which obviously should pass actual managed entities.

The reproducer will definitely be interesting.

Yes, that’s the object I’m analyzing after the merge. The OneToMany property is null before the merge on the detached object, and after the merge on the object returned. The object is managed after the merge, but the OneToMany property is null. If you refresh the object, the property that was null lights up like a Christmas tree with a proxy, and that proxy returns an array with a nonzero length.

Here is the JIRA issue, with test cases pushed to GitHub: HSEARCH-4323

Thanks! I’ll have a look as soon as I can.

So, I had a look at the tests, and I understand better now. Thanks for that.

I’d like to come back to something you said:

Not exactly; the buffering is only there to allow people to perform a session.flush(); session.clear(); if we process entities on flush, entities can safely be cleared and we will still be able to perform indexing on commit.

So, the problems you are facing are different. They are problems we didn’t attempt to fix in Search 6, and quite frankly, problems I wasn’t aware of :slight_smile:

If I sum up, the three problems you are facing are:

  1. Session.merge returns an object whose associations are still set to null, and puts that object in the persistence context (which will affect Hibernate Search at some point). I’m not sure if it’s because the JPA spec mandates it (you did ask ORM to merge your object, whose association is set to null, after all) or because of a bug.
  2. You code does not initialize both sides of associations, even when creating entities.
  3. Your code sometimes creates entities whose associations point to detached entities; see in particular BITTestCase6 in your reproducer, where itemVendorInfo1 is persisted while still referencing a detached entity:
    Vendor vendor1 = new Vendor(1L, "Motor Company");
    Item item1 = new Item(1L, "Electric Motor");
    
    ItemVendorInfo itemVendorInfo1 = new ItemVendorInfo(1L, item1, vendor1, new BigDecimal("1000"));
    em.persist(itemVendorInfo1);
    
    // if this isn't done, the index will not see any of the elements that belong in the @OneToMany collection
    em.flush();
    

All these problems lead to detached, or at least out-of-sync entities being processed by Hibernate Search, which in turn leads to incorrect (incomplete) documents being indexed.

Evidently, all these three problems were there before (in Search 5) too, which is why you had to add refresh operations just before the commit, so that the objects are back to what they should be.

We could consider fixes to remove the need for your workaround (either in Search or in ORM), but given the nature of what would have to be implemented, this will take a lot of discussion (to agree on what ORM/Search should support) and engineering (to find a solution that is reasonably backwards compatible, with as few side effects as possible, and that performs well). So this would be a long-term goal. What we need right now is a short-term solution.

So, let’s be clear, and let’s be constructive: the regression is the fact that your workaround (refreshes before commit) no longer works, and the immediate need is to make it work again, or to provide a way for you to implement an alternative, equivalent workaround.

I can see two quick ways out (feel free to suggest if you can see another one):

  1. Add a configuration option to disable processing for automatic indexing (= filling the internal buffer) on flush. Like I said, this cannot be the default, because it would break other use cases (loop with periodic flush + clear, in particular).
  2. Add a way to register hooks to be executed before we perform processing for automatic indexing (= before we fill the internal buffer). Then you could perform you refreshes in that hook.

Solution #1 is much (much) simpler to implement has the potential to impact lots of things, so it would be a nightmare to test properly. It would also mean that when you use this configuration option, you can’t use the “loop with periodic flush + clear” pattern.

Solution #2, while more complex, is much more localized in its impact and easier to test, and would not break the “loop with periodic flush + clear” pattern.

So personally, I’d rather work on solution #2.

Do you think that could work?

What would you need Hibernate Search to pass to your hook?

How would you need to register the hook: would configuration properties be fine?

Do you think that could perform adequately, considering how you perform flushes in your application?

…and so a LazyInitializationException can be avoided, as alluded to in the HS6 Migration Guide, right?. This is what we were referring to. This is one function of our interceptor, which did for HS5 what the buffering now does for HS6.

I only mentioned this because the flush(); clear(); use case was still possible for us to implement in HS5, it just took a little more work to get it accomplished, and that work is contained in our interceptor. The fact that your implementation had the side-effect of being a breaking change for other use cases is unrelated to our interceptor, which is going to be irrelevant from the perspective of HS6. (It will still be useful for our serialization process.) No other connection to the issues we’re having was intended to be stated or implied, and none of our interceptor logic is found in the test cases.

This is the main problem that we both face. It’s a troublesome issue, one that we should mention to the Hibernate team somehow.

It has not posed much of a problem for us until now, as we simply used refresh(); to revitalize any objects returned in this half-baked state. flush(); is required prior to refresh(); so that new property values that were just merged do not get overwritten, otherwise we probably would not call flush() much.

However, I believe that flush(); can be triggered in other ways, such as by running certain hql queries.

So yes, this is how we have been compensating for the way ORM returns objects that get reattached to the Session / EntityManager. We currently know of no better way to completely reattach a detached object to the current Session, as if it had been read from it initially, with two lines of code: flush(); refresh();, and until now, have not had a reason to go looking for a better one.

I am not aware if ORM exposes an API which can turn an association into a HibernateProxy on demand. If it does, we could loop through all of our association properties after a persist or merge of a detached object, and do this on our own. It would be best if ORM did this internally. We would also be able to avoid an early flush();, because our call to refresh(); would no longer be necessary. More on this below…

That’s correct, because ORM has no need for us to do so. When the other side can be lazy loaded on demand (and could be huge), there has been no need to hard code this. The @OneToMany with mappedBy relationship just a convenience method to automatically initiate a left join query to obtain a collection on demand; it isn’t a concept found in the database schema. Creating or merging, the other side is commonly going to be huge either way, and generally should not be initialized (unless it is indexed) just to add one more record to the list which will be found anyway by lazy loading.

Absolutely, and this is no issue for ORM either. Almost all @ManyToOne properties are set on objects by the client side (Windows, macOS, iPhone, etc.), and sent back to the server to persist or merge. We rarely create entity objects in Java, except by deserialization off the wire.

These incoming objects were once loaded by a Session, were serialized and sent to the client, were then displayed by the client UI, and got sent back to the server, at which point they are, of course, detached. The fastest and most efficient thing to do on the server side at this point is to validate, persist or merge, and then refresh to obtain “armed” proxies for loading associations on demand so that the client can receive the latest hierarchical tree.

Trying to manage “the other side” for every @ManyToOne association would be a performance disaster to initialize them just to add one more entity to the list, unless of course the collection is indexed, and if it is, HS can simply traverse the relationship which will do the loading for us, without hard-coding. Most collections are lazy and are not sent to the client.

This would be excellent.

I hope I have always been clear and constructive: yes, this is one way to describe the situation. You would think that after a certain use case has been supported for more than 12 years however, it might be eligible to be promoted from “workaround” status to a bone fide use case, no? :wink:

Hindsight is always 20/20. :grin:

With our use case, Hibernate Search 6 was not backwards compatible with previous releases, and you are right, it has never been able to cope with these behaviors that Hibernate really has, and has always had for the 17 years we’ve been using it, and yet no unit test revealed the issue. We just found ways to cope and moved on without thinking much of it. The facilities and timings were there to make everything right in the end, and that is what mattered the most.

We do hope that more unit tests will be added that exercise Hibernate with our use case, with objects that have been reattached to the Session, so we will be protected in future releases.

Working with detached but still in sync entity data is a very common pattern in n-tier applications. All of our entities have @Version columns which make sure the client had the most up-to-date copy at the time we try to merge.

We agree, solution #2 sounds better than solution #1. Your choice is fine regarding if it is a config property (persistence.xml) or an API to register an event listener.

First, I would like to go take a look at the APIs that Hibernate uses to proxy-ize objects, to see if I could use them. If I can, you might not have to make any changes if I pass all objects coming in from clients through a routine that would replace all references with proxies. If that is possible, then you wouldn’t have to make any changes at all.

We only use flush as a way to protect our merged data from the next refresh();, and we only use refresh(); so we can obtain proxies for lazy loading. We don’t care that much about flush(); by itself.

There was a time when we needed to use refresh(); after persist(); to obtain the database assigned @Id property value, so we could move forward with using the new object as a reference, rather than seeing a Hibernate exception because the @Id was still zero, which looks like an entity not saved.

I need to do some testing to see if this is still necessary in the lastest ORM release with the JDBC drivers we use today. We use identity properties at the database level, and there was a time when the database dialect was not augmented to the extent of being aware of how to update the @Id after insert. We could also write a new dialect if needed.

Thank you for all the help, @yrodiere! Let’s keep brainstorming the best short-term approach, and then begin the process of achieving the long-term goals as well. I’ll be working on this until we arrive at a good stopping point.

You sure have. I should have said "I will try to be constructive :slight_smile:

I meant it’s still a workaround because its reason to be is to avoid something that can reasonably be considered as a bug. But sure, given that this bug is nowhere near being fixed, we can call that workaround a use case.

Attaching all entities to the session before you start doing anything would indeed be a great way to avoid this problem altogether, and more.

There’s been a lot of performance improvements in proxies in the past few years, so we can hope that some of them make this approach realistic (since I suppose it wasn’t before for performance reasons?).

In any case, please let me know if I can help. I could provide branches of Hibernate Search for you to build locally and experiment with, if you need changes in Hibernate Search.

Thank you! Should I try creating a JIRA issue in Hibernate regarding the state that entities have after being returned from the entity manager methods, or would you like to enter it to make sure the observation is communicated in a way that would be understood by that team?

About null associations in the merge “input” entity still being null in the “output” entity? I think it’s better if you create it yourself, as you’ll be able to answer their questions more easily than I will. Thanks!

Though the ORM team is quite busy with ORM 6 at the moment, so don’t expect an immediate response. But as we already discussed, this is a longer-term goal anyway :slight_smile:

Hello,

Have you had any luck with this strategy?

We have made some progress.

As long as every entity instance is instantiated using the following process, Hibernate Search is able to correctly traverse the hierarchical tree of IndexedEmbedded associations during indexing caused by a flush():

Serializable id = 1L;
SharedSessionContractImplementor session = entityManager.getDelegate().unwrap(SharedSessionContractImplementor.class);
MetamodelImplementor entityMetamodel = session.getFactory().getMetamodel();
EntityPersister persister = entityMetamodel.entityPersister(Entity.class.getName());
Entity instance = (Entity) persister.instantiate(id, session);

This method works successfully with persist() of a new entity, and merge() of a detached entity in our simple test cases.

We have also experimented with turning on biDirectionalAssociationManagement, and this also causes indexing to happen correctly after flush in our simple test cases, but we have reservations about this feature. We anticipate that turning on biDirectionalAssociationManagement would trigger lazy loading of too many huge collections on the other side of every association we touch, even when they are not indexed. We still need to test this theory to know for sure.

The first solution would surely not exhibit this undesirable side-effect, because no lazy loading would occur unless that particular relationship is traversed. This would exhibit the same performance profile as our application used to have with Hibernate Search 5.

Since the routine above requires the session to be open, and our deserialization code runs outside of the session in a third-party communication layer, our next challenge is to see if we can make the entity manager available a lot earlier in the lifecycle of a remote function call from the client to the server.