Hibernate Search 6 TokenFilterDef, Could i use `token_chars` of NGramFilterFactory to define ngram?

Question

I did find the full text search by NGramFilterFactory.java for a way to set up properties of letter, digit, whitespace, symbol in token_chars.
But the NGramFilterFactory.java is only offers only two properties as minGramSize and maxGramSize.
The latest version of NGramFilterFactory.java in lucene-analyzers-common.8.2.0 is also does not support configuration.
I need to set up token_chars. Our project’s data is complex.

Please support this issue.

Environment

  1. @AnalyzerDef
@Indexed(index = "####")
@AnalyzerDef(name = "ngram",
        tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
        filters = {
                @TokenFilterDef(factory = StandardFilterFactory.class),
                @TokenFilterDef(factory = LowerCaseFilterFactory.class),
                @TokenFilterDef(factory = NGramFilterFactory.class,
                        params = {
                                @Parameter(name = "minGramSize", value = "1"),
                                @Parameter(name = "maxGramSize", value = "300")
                        })
        })
  1. mvn dependency:tree
[INFO] +- org.hibernate:hibernate-search-orm:jar:5.11.3.Final:compile
[INFO] |  \- org.hibernate:hibernate-search-engine:jar:5.11.3.Final:compile
[INFO] |     +- org.apache.lucene:lucene-core:jar:5.5.5:compile
[INFO] |     +- org.apache.lucene:lucene-misc:jar:5.5.5:compile
**[INFO] |     +- org.apache.lucene:lucene-analyzers-common:jar:5.5.5:compile**
[INFO] |     +- org.apache.lucene:lucene-facet:jar:5.5.5:compile
[INFO] |     |  \- org.apache.lucene:lucene-queries:jar:5.5.5:compile
[INFO] |     \- org.apache.lucene:lucene-queryparser:jar:5.5.5:compile
[INFO] +- org.hibernate:hibernate-search-elasticsearch:jar:5.11.3.Final:compile
[INFO] |  +- org.elasticsearch.client:elasticsearch-rest-client:jar:6.4.3:compile
[INFO] |  |  +- org.apache.httpcomponents:httpasyncclient:jar:4.1.4:compile
[INFO] |  |  \- org.apache.httpcomponents:httpcore-nio:jar:4.4.11:compile
[INFO] |  \- org.elasticsearch.client:elasticsearch-rest-client-sniffer:jar:5.6.8:compile
  1. pom.xml
    <!-- hibernate -->
    <dependency>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-data-jpa</artifactId>
      <!--<version>2.1.8.RELEASE</version>-->
    </dependency>
    <dependency>
      <groupId>org.hibernate</groupId>
      <artifactId>hibernate-core</artifactId>
      <version>5.4.1.Final</version>
    </dependency>
    <dependency>
      <groupId>org.hibernate</groupId>
      <artifactId>hibernate-search-orm</artifactId>
      <version>5.11.3.Final</version>
    </dependency>
    <dependency>
      <groupId>org.hibernate</groupId>
      <artifactId>hibernate-search-elasticsearch</artifactId>
      <version>5.11.3.Final</version>
    </dependency>
  1. elastic version : 6.3.2
    NGram Tokenizer
  2. The latest version also does not support configuration.

Note

I’ve implemented indexing for large data about 4,000,000. Do you recommend which other is the best code?

 @Transactional
    public void buildLargeSearchIndex() {
        int offset = 0;
        int batchSize =1000;
        boolean indexComplete = false;
        while (!indexComplete) {
            FullTextEntityManager fullTextEntityManager =
                    org.hibernate.search.jpa.Search.getFullTextEntityManager(entityManager);
            TypedQuery<####> query = fullTextEntityManager
                    .createQuery("SELECT u FROM #### u", ####.class);
            query.setFirstResult(offset);
            query.setMaxResults(batchSize);

            log.info("Indexing {}, offset {}", batchSize, offset);
            List<####> results = query.getResultList();
            if (results == null || results.isEmpty()) {
                indexComplete = true;
            } else {
                offset += results.size();
                for (#### user : results) {
                    fullTextEntityManager.index(user);
                }
            }
        }
        log.info("Indexed {} objects", offset);
    }

I’m sorry, what did you find? Could you point us to the documentation that lead you to try to set these properties for the ngram filter factory? As far as I know they are supported in an entirely different filter factory, so I’m not sure what you’re trying to do.

Your code is not ideal because:

  1. It runs everything in a single transaction, which depending on the isolation level may lead to very long locks on your database.
  2. It runs everything in a single entity manager without clearing it, which will lead to (very) high memory usage.
  3. It relies on pagination, which depending on the isolation level of your transaction may lead to some entities not being indexed (because the list of results to your query changed between two calls to getResultList()).

If you’re not familiar with the kind of problems that batch processes involve, I’d recommend using the mass indexer.
If the mass indexer is not what you’re looking for, see the section dedicated to reindexing manually in the documentation.

1 Like

In the sample code below, I thought it would be possible to add @parameter(name="tokenChars", value="letter,digit...".
But NGramFilterFactory is offered only two properties minGramSize and maxGramSize.
Is it possible to add 'tokenChars` of NGramFilterFactory?

  1. @AnalyzerDef of ngram sample

Thank you.
Do you recommend me using sample code basically spring boot for mass indexer. ^^

Okay, but why did you think that? The “tokenChars” parameter is not mentioned in the example, and the ngram filter is not a tokenizer. What are you trying to do? More importantly, where did you get this idea?

Are you trying to use Elasticsearch-specific configuration? Like “token_chars” parameter in the Elasticsearch NGramTokenizer? The ngram tokenizer and the ngram filter are not the same thing.

I do not understand what you mean by " sample code basically spring boot ". But I recommend using the mass indexer, yes.

This project is working with ‘elastic search’.
I need a like (ex: '%NIKE 1234%')'. My search document is irregular words contains letters、numbers and Chinese&Japanese characters. Therefore, I hope that the token Chars of the ngram filter will be set when the index is created in elastic search`.

Do you recommend which other is the best search logic like like query of DBMS.

I have implemented test code. It was resolved.
It is amazing that the index is created quickly.

Please help me for us ^^. I’m suffering from various problems…

Two things:

  1. There is no direct equivalent to LIKE in Lucene/Elasticsearch, or at least not one that performs well enough to be considered. Lucene/Elasticsearch are about full-text search, so they take a different, more sensible approach: you don’t match “substrings”, you match “tokens” (words). There’s a decent introduction to full-text search in Hibernate Search 6’s documentation: Hibernate Search 6.0.11.Final: Reference Documentation Hibernate Search 6 is still in beta and APIs are vastly different but the concept is the same.
  2. A single filter will not address all your problems. The idea is to find the right tokenizer/filter for each problem, and to combine them.

So, about the filter…

  • To make the query “nike” match the indexed text “nike 1234”, use a tokenizer. The whitespace tokenizer will tokenize on spaces only, the standard tokenizer will have more advanced behavior that tokenizes on punctuation, which may be what you’re looking for.
  • To make the query “Nike” match the indexed text “nike” (i.e. to get case insensitivity), use the lowercase filter.
  • To make the query “1234” match the indexed text “12345” (match the beginning of words), use an edge-ngram. Not ngram, edge-ngram. Check out the javadoc, they are different.
  • To make the query “1234” match “1243” or “4123”, use an ngram. It’s really about matching parts of tokens instead of the whole token, but it’s not exactly the same as LIKE in SQL. The query “nike 1234” with a min gram size of 2 and a max gram size of 3 will match “ni 123”, for example. You will have to rely on scoring (sort hits by relevance), which is enabled by default, to get the best matches first.
  • To handle Japanese/Chinese, I suppose you’ll have to rely on the ICU tokenizer and CJK filter, but frankly I don’t have a clue how these work (I’ve never had to work with Japanese/Chinese).