Backend agnostic analyzer in HS6

In HS5 we used to provide analyzer definitions within the class, and these worked with both Lucene and ES back ends. This was useful for us where we have some instances running with Lucene (especially for testing) and some with ES.

With the move to HS6 the analyzer definitions instead need to be separate classes implementing the relevant interfaces (LuceneAnalysisConfigurer or ElasticsearchAnalysisConfigurer). Is there any way to specify an analyzer that can be used by both? This will help us as we won’t have to duplicate the analyzers and port changes from one to the other.

There is not.

One of the main objectives of HSearch 6 was to avoid leaking a Lucene dependency through core APIs, because a dependency to Lucene in the Elasticsearch backend causes many headaches for users of the high-level Elasticsearch REST client.

The way analysis definition was done in a “backend-agnostic” (but not quite) way in HSearch 5 required a dependency on Lucene, and thus it had to go.

If you’re not using the high-level Elasticsearch REST client, you may want to use that translation layer anyway. You can find it in the code of HSearch 5, and copy it to your own code (if LGPL-compatible) or to a separate open-source library; I’m sure there’s a way to make it work independently from Hibernate Search 5:


If you’re interested in contributing a less hackish solution directly into Hibernate Search 6, I suppose it would be possible, but not without challenges.

First, it would require defining standard names for each tokenizer/filter, as well as their parameters. Elasticsearch already does, and Lucene does too (see org.apache.lucene.analysis.custom.CustomAnalyzer.Builder#withTokenizer(java.lang.String, java.lang.String...)), but IIRC the names are not the same. Elasticsearch uses a dash-separated syntax while Lucene uses (IIRC) camel-case. That kind of conversion could be automated, but I believe they use a few completely different names as well (in particular for parameters).

Also, some parameter values would require some amount of backend-specific translation, such as when you pass a reference to a local file: Lucene can use it directly, but for Elasticsearch we need to parse it and transform it to whatever format Elasticsearch expects. In HSearch 5 we used to do that by relying on parsing methods from Lucene, but in HSearch 6 we don’t want a Lucene dependency in the Elasticserach backend, so we will need a different solution. Probably reimplement our own parser, which I’m definitely not keen on maintaining.

So, such a layer would still require some amount of translation from Hibernate Search, some of which cannot be automated, and thus must be manually maintained and kept in sync with changes in Lucene and, more importantly, Elasticsearch. This translation layer will probably always be somewhat late and lack some of the latest or more exotic tokenizers/filters or their parameters.

But I suppose, if it’s just for tests, some imperfections can be acceptable. Worst case, we can merge your patch, initially marking the API as incubating, and see how it goes.

To be honest for our use case it sounds like it will be less work to simply duplicate the classes and specify the class to use at runtime, thought it was just worth checking.

Thanks once again for your quick and detailed response, greatly appreciated!