Two things:
- There is no direct equivalent to
LIKE
in Lucene/Elasticsearch, or at least not one that performs well enough to be considered. Lucene/Elasticsearch are about full-text search, so they take a different, more sensible approach: you don’t match “substrings”, you match “tokens” (words). There’s a decent introduction to full-text search in Hibernate Search 6’s documentation: Hibernate Search 6.0.11.Final: Reference Documentation Hibernate Search 6 is still in beta and APIs are vastly different but the concept is the same. - A single filter will not address all your problems. The idea is to find the right tokenizer/filter for each problem, and to combine them.
So, about the filter…
- To make the query “nike” match the indexed text “nike 1234”, use a tokenizer. The
whitespace
tokenizer will tokenize on spaces only, thestandard
tokenizer will have more advanced behavior that tokenizes on punctuation, which may be what you’re looking for. - To make the query “Nike” match the indexed text “nike” (i.e. to get case insensitivity), use the
lowercase
filter. - To make the query “1234” match the indexed text “12345” (match the beginning of words), use an
edge-ngram
. Notngram
, edge-ngram. Check out the javadoc, they are different. - To make the query “1234” match “1243” or “4123”, use an
ngram
. It’s really about matching parts of tokens instead of the whole token, but it’s not exactly the same asLIKE
in SQL. The query “nike 1234” with a min gram size of 2 and a max gram size of 3 will match “ni 123”, for example. You will have to rely on scoring (sort hits by relevance), which is enabled by default, to get the best matches first. - To handle Japanese/Chinese, I suppose you’ll have to rely on the ICU tokenizer and CJK filter, but frankly I don’t have a clue how these work (I’ve never had to work with Japanese/Chinese).