Backend agnostic analyzer in HS6

There is not.

One of the main objectives of HSearch 6 was to avoid leaking a Lucene dependency through core APIs, because a dependency to Lucene in the Elasticsearch backend causes many headaches for users of the high-level Elasticsearch REST client.

The way analysis definition was done in a “backend-agnostic” (but not quite) way in HSearch 5 required a dependency on Lucene, and thus it had to go.

If you’re not using the high-level Elasticsearch REST client, you may want to use that translation layer anyway. You can find it in the code of HSearch 5, and copy it to your own code (if LGPL-compatible) or to a separate open-source library; I’m sure there’s a way to make it work independently from Hibernate Search 5:


If you’re interested in contributing a less hackish solution directly into Hibernate Search 6, I suppose it would be possible, but not without challenges.

First, it would require defining standard names for each tokenizer/filter, as well as their parameters. Elasticsearch already does, and Lucene does too (see org.apache.lucene.analysis.custom.CustomAnalyzer.Builder#withTokenizer(java.lang.String, java.lang.String...)), but IIRC the names are not the same. Elasticsearch uses a dash-separated syntax while Lucene uses (IIRC) camel-case. That kind of conversion could be automated, but I believe they use a few completely different names as well (in particular for parameters).

Also, some parameter values would require some amount of backend-specific translation, such as when you pass a reference to a local file: Lucene can use it directly, but for Elasticsearch we need to parse it and transform it to whatever format Elasticsearch expects. In HSearch 5 we used to do that by relying on parsing methods from Lucene, but in HSearch 6 we don’t want a Lucene dependency in the Elasticserach backend, so we will need a different solution. Probably reimplement our own parser, which I’m definitely not keen on maintaining.

So, such a layer would still require some amount of translation from Hibernate Search, some of which cannot be automated, and thus must be manually maintained and kept in sync with changes in Lucene and, more importantly, Elasticsearch. This translation layer will probably always be somewhat late and lack some of the latest or more exotic tokenizers/filters or their parameters.

But I suppose, if it’s just for tests, some imperfections can be acceptable. Worst case, we can merge your patch, initially marking the API as incubating, and see how it goes.