How to use Hibernate Search to instead SQL like in a faster way?


#1

SQL like will reduce the speed of search, so I want to use Hibernate Search to instead it.Because in my search, there are number、letters And Chinese characters, so to totally instead sql like ,I only find a simple
filter( NGramTokenizerFilterFactory.class ).param( “minGramSize”, “1” ).param( “maxGramSize”, “20” );
and I do the search By:

Query luceneQuery = mythQB
    .simpleQueryString()
    .onField("history")
    .withAndAsDefaultOperator()
    .matching(searchText)
    .createQuery();

But today ,I find that the length of the searchText bigger, the time of search is longer, what could I do to make the search faster? Or other way to finish this search?


#2

Post you full analyzer definition so that we can understand what’s going on. Also please explain what you’re trying to achieve (give a few examples of a document and the search terms that should match).

From what you gave us, I would say you probably want to use a WhitespaceTokenizerFactory rather than an NGramTokenizerFilterFactory. Computing ngrams is pretty expensive.


#3

Thanks for your reply. I’m sorry for that I can’t make myself clear.My explanation are as fellows:


#4

I want to find a way to replace SQL search such as column LIKE '%searchTerm% .
And My search fileds is irregular,such as


It contains number、letter and Chinese characters.What I want to achieve is that to find the item that contains my input words.For example,when I input “a”, I could find “a”, “ab”, “ba”, “a11”, “11a”, “汉字a” that all the items contains a. I find that NGramFilterFactory can’t work with Chinese characters.


#5

So I use the analyzer follow:


To get all I want to get , I have to make the masGramSize as 20 for I the length of search Fields is less than 20.
And I find three ways to do my search:

1、 Query query = qb.keyword().onField("remarks").ignoreAnalyzer().matching(searchText).createQuery();
2、        Query query = qb.simpleQueryString().onField("remarks")
                .withAndAsDefaultOperator()
                .matching(searchText).createQuery();

3、use filters

fullTextQuery = s.createFullTextQuery(query, Driver.class);
fullTextQuery.enableFullTextFilter("remarks").setParameter(searchText);
fullTextQuery.list(); //returns only best drivers where andre has credentials

But By search from 30 items, I find that when the length of My input word more than eight, the time of search is bigger than search by SQL, what could I do to make the search faster?


#7

I need “contains” predicates (with a wildcard at the beginning of the search term, such as column LIKE '%searchTerm% ) .And my search document is irregular words contains letters、numbers and Chinese characters. Their length is smaller then 20. What could I do to make my search faster? Is there any other way to make my search faster?Now I find that user ignoreAnalyzer() is fastest.And in my condition, will Hibernate search be faster than search by HQL?


#8

And in my condition, will Hibernate search be faster than search by HQL?

It depends on many things, like the analyzers you use, the size of your data set, the RDMS you use, hardware, contention, … I can’t possibly answer to that kind of question, you just have to try to find out. I can just say that Lucene is quite fast in general, so performance issues are likely to be configuration issues.

That being said, your use case is not exactly the standard full-text use case: you’re looking for random sequence of characters, not for actual words. Full-text search (Hibernate Search + Lucene) will work, but that’s not what it’s best at.

So. In you case, to make things faster you probably do need ngrams. I hope your dataset isn’t too extensive, or you have a lot of disk space for your indexes, because ngrams take a lot of space. That’s the trade-off: it may be faster, but will require a lot of disk space.

I’ve never used non-edge ngrams myself, but I could advise the following.

When querying, you should not apply the same n-gram tokenizer. If you do, then when searching for “ab” for example, Lucene would try to find any document that contain “a”, OR “b”, OR “ab”. Which is three queries instead of just the one you would want, i.e. just the documents that contain “ab”.

Thus, when querying, you’ll probably want to apply an analyzer that has the same filters, but no tokenizer at all (i.e. that uses the KeywordTokenizerFactory, which does not tokenize at all). This way, you’ll get the correct behavior: the search term will have to match against one ngram.

So:

  1. Define an additional analyzer named “customAnalyzer_query”, which is the same as “customAnalyzer”, but with a KeywordTokenizerFactory instead of NGramTokenizerFactory. Do not remove the current analyzer definition (“customAnalyzer”) and leave it on the remarks field. The new analyzer is an addition, not a replacement.
  2. When you create the query builder, make sure to override the analyzer:
    QueryBuilder qb = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Driver.class)
        .overridesForField( "remarks", "customAnalyzer_query" )
        .get();
    
  3. Build your query like this (do not ignore the analyzer):
    Query query = qb.keyword().onField("remarks").matching(searchText).createQuery();
    

The reason you should not ignore the analyzer is that if some filters are not applied, for example the StandardFilter, then your search terms may unexpectedly not match the ngrams. For example a document containing “SOMEWORD” will contain the ngram “someword” (lowercased), thus a query with the search term “SOMEWORD” ignoring the analyzer will not match, since the search term is not lowercased.


#9

Thanks for your answer.

I tried many ways to compare the speed of search between HQL and Hibernate Search , and I found that if I only have 10 documents, search by HQL is faster.But when I have lager number of documents such as I have 100 thousand documents, search by Lucene is faster.So I make a conclusion that in general Lucene don’t have advantage when I only have a small number of documents in my search above. Is my conclusion true in general?
And I also have another question.Will Hibernate Search be faster than HQL in general when do search such as “FROM Book WHERE color= Blue?” I do test to compare it but can’t find rule of it.
So in general will Lucene be faster than HQL in my two conditions?
One condition is that I only have a small number of documents and use non-edge ngrams , the other is search like “FROM Book WHERE color= Blue?”

And to make my search faster ,should I avoid to use @IndexedEmbedded? will using @IndexedEmbedded make the search take more time? Now I find that using @IndexedEmbedded don’t have affect on search only has affect on writing indexes when change index document.

I can’t get conclusions by my test so could you tell me the answer in general ? I’m worry that I use Hibernate Search wrongly or the truth is what I find.


#10

I tried many ways to compare the speed of search between HQL and Hibernate Search , and I found that if I only have 10 documents, search by HQL is faster.But when I have lager number of documents such as I have 100 thousand documents, search by Lucene is faster.So I make a conclusion that in general Lucene don’t have advantage when I only have a small number of documents in my search above. Is my conclusion true in general?

Yeah, it’s probably true. However, I fail to see how it’s relevant, since the case where you have 10 documents will be fast regardless of the technology. I don’t usually test that kind of setup in performance tests.

I can’t get conclusions by my test so could you tell me the answer in general ? I’m worry that I use Hibernate Search wrongly or the truth is what I find.

I’m not sure what you’re asking from me. I already gave you the answer in general, which is “it depends”:

  • I told you that Lucene is fast. It’s a wildely used piece of software and there’s no question it performs well.
  • I also told you I can’t answer as to whether it’s faster than your database, because first it would depend on your database (some are blazing fast, some are slow as hell), on your specific query and schema (performance can improve drastically when adding a column index in a database), on your hardware, and so on.

The question is: why does it matter?

  • If Hibernate Search solves problems you can’t solve with HQL, and it performs satisfyingly (queries taking less than some arbitrary amount of time you set, say 50ms), then why would you care if it’s faster or slower than HQL? HQL can’t help you anyway.
  • If Hibernate Search does not solve problems you can’t solve with HQL, and HQL performs satisfyingly, why would you even consider Hibernate Search?
  • If your main problem is the performance of your HQL queries, not features, then why would you care that Search is faster than HQL at all times? It just needs to be “fast enough” at all times. Taking a silly example: if Search takes 80ms to return results when there are few documents while HQL takes 30ms, then when there are more than 100,000 documents Search takes 100ms and HQL 500ms, you just don’t care which is faster, you care that Search performs satisfyingly overall and HQL does not, because 500ms is way too slow.