Clarification/Confirmation of Shared Lucene index folder


#1

Just wanted to run something by the experts:

I’m using Hibernate 5.2.12.Final, Hibernate Search 5.8.2.Final, Lucene 5.5.5.

I built a small app to run the FullTextSession class with a scrollable result set to scan the relevant tables and build initial indexes - all is well.

But based on the diagram of the Search Documentation, “Section 2.2.1. Lucene”, it appears that in order to have the indexes “shared” between JVMs, I’ll have to have the same volume mounted on each node of my clustered servers. Assuming this is possible, is this correct and what is meant by the same folder “shared” between JVMs?

Thanks!


#2

The strategy you are referring to only works if you use a Lucene DirectoryProvider that manages sharing and locking internally. The default DirectoryProvider (FileSystem) does not.

I think this part of the documentation is mainly referring to the Infinispan Directory provider, which does implement some locking, but only in a non-optimal way meant as a “last line of defense”: http://infinispan.org/docs/stable/user_guide/user_guide.html#architectural_limitations . See also this part about Hibernate Search specifically: http://infinispan.org/docs/stable/user_guide/user_guide.html#architecture_considerations

From what I understand, even with the Infinispan DirectoryProvider you still need to coordinate your writes in some way to get reasonable write performance, and this is generally done with the JMS or JGroups backend in Hibernate Search.

I will let Sanne expand on this if he thinks it necessary, because he knows much more about the clustering part of Hibernate Search than I do…


#3

The second link is the same as the first. ?


#4

The two links point indeed to the same document, but Yoann is pointing to two different sections using anchors.
In other words, see “21.4.4. Architecture considerations” and “21.3.8. Architectural limitations” .

In short, your interpretation is partially correct. You will need to enable some shared filesystem across the JVMs, but then you need to configure Hibernate Search according to those chapters to make sure it uses such shared mount points not for the “live” index but only to copy to/from it at periodic intervals. It is not possible to share the index directly on a shared mount because of locking issues.


#5

Ok, based on what I’ve read, in order to get this to work with the least latency and problems, we’ll need: Wildfly, hosting a JMS instance against a database, accepting updates from each node, but only the master (running on this same box) will update the index. Replication will be done by each client per configuration (Example 14. JMS Slave configuration).

The only concern I have is that the nodes are non-EE (Tomcat). But they just have to implement similar code as shown in the MDBSearchController example. As long as a similar class is available within the non-EE application, Hibernate Search will use it (per the configuration) to write the updates to the queue.

On the other end, do I assume correctly, that a local app will be necessary to read the queue (per the configuration) and then HS process the updates? This should be very small/simple; is there an example?


#6

On the other end, do I assume correctly, that a local app will be necessary to read the queue (per the configuration) and then HS process the updates?

If by “on the other end” you mean “on the master node”, then yes, you are correct.

This should be very small/simple

The master node doesn’t need to run your webapp, if that’s what you mean. Only Hibernate ORM + Hibernate Search + the “MDBSearchController” are needed.

; is there an example?

For the full setup, I don’t think so, at least not beyond the documentation. There are our integration tests, but they are more likely to be confusing than anything else, being full of code that is only necessary for testing.

For the JMS controller, you can have a look at the abstract base class in Hibernate Search. But the easiest solution is to extend this class and plug it into your JMS runtime. How to do that should be explained in the documentation of your JMS runtime, I guess.

On a side note, be careful not to use dynamic sharding with the clustering setups; there are known issues


#7

Thanks for the guidance! Doesn’t look too hard. And we shouldn’t need to shard anything at the outset. I did a test index build on my workstation and against a copy of production and it was only 8.5 GB, very reasonable.