Hibernate Search 6 TokenFilterDef, Could i use `token_chars` of NGramFilterFactory to define ngram?

nemo · September 30, 2019, 12:23pm

Question

I did find the full text search by NGramFilterFactory.java for a way to set up properties of letter, digit, whitespace, symbol in token_chars.
But the NGramFilterFactory.java is only offers only two properties as minGramSize and maxGramSize.
The latest version of NGramFilterFactory.java in lucene-analyzers-common.8.2.0 is also does not support configuration.
I need to set up token_chars. Our project’s data is complex.

Please support this issue.

Environment

@AnalyzerDef

@Indexed(index = "####")
@AnalyzerDef(name = "ngram",
        tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
        filters = {
                @TokenFilterDef(factory = StandardFilterFactory.class),
                @TokenFilterDef(factory = LowerCaseFilterFactory.class),
                @TokenFilterDef(factory = NGramFilterFactory.class,
                        params = {
                                @Parameter(name = "minGramSize", value = "1"),
                                @Parameter(name = "maxGramSize", value = "300")
                        })
        })

mvn dependency:tree

[INFO] +- org.hibernate:hibernate-search-orm:jar:5.11.3.Final:compile
[INFO] |  \- org.hibernate:hibernate-search-engine:jar:5.11.3.Final:compile
[INFO] |     +- org.apache.lucene:lucene-core:jar:5.5.5:compile
[INFO] |     +- org.apache.lucene:lucene-misc:jar:5.5.5:compile
**[INFO] |     +- org.apache.lucene:lucene-analyzers-common:jar:5.5.5:compile**
[INFO] |     +- org.apache.lucene:lucene-facet:jar:5.5.5:compile
[INFO] |     |  \- org.apache.lucene:lucene-queries:jar:5.5.5:compile
[INFO] |     \- org.apache.lucene:lucene-queryparser:jar:5.5.5:compile
[INFO] +- org.hibernate:hibernate-search-elasticsearch:jar:5.11.3.Final:compile
[INFO] |  +- org.elasticsearch.client:elasticsearch-rest-client:jar:6.4.3:compile
[INFO] |  |  +- org.apache.httpcomponents:httpasyncclient:jar:4.1.4:compile
[INFO] |  |  \- org.apache.httpcomponents:httpcore-nio:jar:4.4.11:compile
[INFO] |  \- org.elasticsearch.client:elasticsearch-rest-client-sniffer:jar:5.6.8:compile

pom.xml

    <!-- hibernate -->
    <dependency>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-data-jpa</artifactId>
      <!--<version>2.1.8.RELEASE</version>-->
    </dependency>
    <dependency>
      <groupId>org.hibernate</groupId>
      <artifactId>hibernate-core</artifactId>
      <version>5.4.1.Final</version>
    </dependency>
    <dependency>
      <groupId>org.hibernate</groupId>
      <artifactId>hibernate-search-orm</artifactId>
      <version>5.11.3.Final</version>
    </dependency>
    <dependency>
      <groupId>org.hibernate</groupId>
      <artifactId>hibernate-search-elasticsearch</artifactId>
      <version>5.11.3.Final</version>
    </dependency>

elastic version : 6.3.2
NGram Tokenizer
The latest version also does not support configuration.

image2706×1354 418 KB

Note

I’ve implemented indexing for large data about 4,000,000. Do you recommend which other is the best code?

 @Transactional
    public void buildLargeSearchIndex() {
        int offset = 0;
        int batchSize =1000;
        boolean indexComplete = false;
        while (!indexComplete) {
            FullTextEntityManager fullTextEntityManager =
                    org.hibernate.search.jpa.Search.getFullTextEntityManager(entityManager);
            TypedQuery<####> query = fullTextEntityManager
                    .createQuery("SELECT u FROM #### u", ####.class);
            query.setFirstResult(offset);
            query.setMaxResults(batchSize);

            log.info("Indexing {}, offset {}", batchSize, offset);
            List<####> results = query.getResultList();
            if (results == null || results.isEmpty()) {
                indexComplete = true;
            } else {
                offset += results.size();
                for (#### user : results) {
                    fullTextEntityManager.index(user);
                }
            }
        }
        log.info("Indexed {} objects", offset);
    }

yrodiere · September 30, 2019, 12:35pm

I’m sorry, what did you find? Could you point us to the documentation that lead you to try to set these properties for the ngram filter factory? As far as I know they are supported in an entirely different filter factory, so I’m not sure what you’re trying to do.

Your code is not ideal because:

It runs everything in a single transaction, which depending on the isolation level may lead to very long locks on your database.
It runs everything in a single entity manager without clearing it, which will lead to (very) high memory usage.
It relies on pagination, which depending on the isolation level of your transaction may lead to some entities not being indexed (because the list of results to your query changed between two calls to getResultList()).

If you’re not familiar with the kind of problems that batch processes involve, I’d recommend using the mass indexer.
If the mass indexer is not what you’re looking for, see the section dedicated to reindexing manually in the documentation.

nemo · September 30, 2019, 1:01pm

In the sample code below, I thought it would be possible to add @parameter(name="tokenChars", value="letter,digit...".
But NGramFilterFactory is offered only two properties minGramSize and maxGramSize.
Is it possible to add 'tokenChars` of NGramFilterFactory?

@AnalyzerDef of ngram sample

image1586×942 234 KB

nemo · September 30, 2019, 1:06pm

Thank you.
Do you recommend me using sample code basically spring boot for mass indexer. ^^

yrodiere · September 30, 2019, 1:41pm

Okay, but why did you think that? The “tokenChars” parameter is not mentioned in the example, and the ngram filter is not a tokenizer. What are you trying to do? More importantly, where did you get this idea?

Are you trying to use Elasticsearch-specific configuration? Like “token_chars” parameter in the Elasticsearch NGramTokenizer? The ngram tokenizer and the ngram filter are not the same thing.

I do not understand what you mean by " sample code basically spring boot ". But I recommend using the mass indexer, yes.

nemo · September 30, 2019, 2:11pm

This project is working with ‘elastic search’.
I need a like (ex: '%NIKE 1234%')'. My search document is irregular words contains letters、numbers and Chinese&Japanese characters. Therefore, I hope that the token Chars of the ngram filter will be set when the index is created in elastic search`.

Do you recommend which other is the best search logic like like query of DBMS.

I have implemented test code. It was resolved.
It is amazing that the index is created quickly.

nemo · October 1, 2019, 8:55am

Please help me for us ^^. I’m suffering from various problems…

yrodiere · October 1, 2019, 10:09am

Two things:

There is no direct equivalent to LIKE in Lucene/Elasticsearch, or at least not one that performs well enough to be considered. Lucene/Elasticsearch are about full-text search, so they take a different, more sensible approach: you don’t match “substrings”, you match “tokens” (words). There’s a decent introduction to full-text search in Hibernate Search 6’s documentation: Hibernate Search 6.0.11.Final: Reference Documentation Hibernate Search 6 is still in beta and APIs are vastly different but the concept is the same.
A single filter will not address all your problems. The idea is to find the right tokenizer/filter for each problem, and to combine them.

So, about the filter…

To make the query “nike” match the indexed text “nike 1234”, use a tokenizer. The whitespace tokenizer will tokenize on spaces only, the standard tokenizer will have more advanced behavior that tokenizes on punctuation, which may be what you’re looking for.
To make the query “Nike” match the indexed text “nike” (i.e. to get case insensitivity), use the lowercase filter.
To make the query “1234” match the indexed text “12345” (match the beginning of words), use an edge-ngram. Not ngram, edge-ngram. Check out the javadoc, they are different.
To make the query “1234” match “1243” or “4123”, use an ngram. It’s really about matching parts of tokens instead of the whole token, but it’s not exactly the same as LIKE in SQL. The query “nike 1234” with a min gram size of 2 and a max gram size of 3 will match “ni 123”, for example. You will have to rely on scoring (sort hits by relevance), which is enabled by default, to get the best matches first.
To handle Japanese/Chinese, I suppose you’ll have to rely on the ICU tokenizer and CJK filter, but frankly I don’t have a clue how these work (I’ve never had to work with Japanese/Chinese).

Topic		Replies	Views
How to use Hibernate Search to instead SQL like in a faster way? Hibernate Search	10	4194	May 10, 2019
Equivalent WhitespaceTokenizerFactory in HS 6 Hibernate Search	4	525	February 28, 2022
Can Someone Please help me out? I am stucked at wildcard search with special characters using StandardTokenizerFactory Hibernate Search	28	2200	August 19, 2020
Not getting exact match in Hibernate Search Hibernate Search	1	1187	September 15, 2020
Search Returns No Results Hibernate Search	11	1894	August 17, 2020

Hibernate Search 6 TokenFilterDef, Could i use `token_chars` of NGramFilterFactory to define ngram?

Related topics