Hibernate Search 6 TokenFilterDef, Could i use `token_chars` of NGramFilterFactory to define ngram?

Two things:

  1. There is no direct equivalent to LIKE in Lucene/Elasticsearch, or at least not one that performs well enough to be considered. Lucene/Elasticsearch are about full-text search, so they take a different, more sensible approach: you don’t match “substrings”, you match “tokens” (words). There’s a decent introduction to full-text search in Hibernate Search 6’s documentation: Hibernate Search 6.0.11.Final: Reference Documentation Hibernate Search 6 is still in beta and APIs are vastly different but the concept is the same.
  2. A single filter will not address all your problems. The idea is to find the right tokenizer/filter for each problem, and to combine them.

So, about the filter…

  • To make the query “nike” match the indexed text “nike 1234”, use a tokenizer. The whitespace tokenizer will tokenize on spaces only, the standard tokenizer will have more advanced behavior that tokenizes on punctuation, which may be what you’re looking for.
  • To make the query “Nike” match the indexed text “nike” (i.e. to get case insensitivity), use the lowercase filter.
  • To make the query “1234” match the indexed text “12345” (match the beginning of words), use an edge-ngram. Not ngram, edge-ngram. Check out the javadoc, they are different.
  • To make the query “1234” match “1243” or “4123”, use an ngram. It’s really about matching parts of tokens instead of the whole token, but it’s not exactly the same as LIKE in SQL. The query “nike 1234” with a min gram size of 2 and a max gram size of 3 will match “ni 123”, for example. You will have to rely on scoring (sort hits by relevance), which is enabled by default, to get the best matches first.
  • To handle Japanese/Chinese, I suppose you’ll have to rely on the ICU tokenizer and CJK filter, but frankly I don’t have a clue how these work (I’ve never had to work with Japanese/Chinese).