High performance autocomplete optimization

fivearise · June 25, 2024, 6:00pm

I have a database with roughly 20k entries. I want to add an autocomplete for one field only.

Here’s the rough version of the autocomplete right now:

entity.keyword().fuzzy().withEditDistanceUpTo(1).onField("title").matching(userInput)

Is there a better way to do this?

I want to optimize the autocomplete so that only the first two words in the field are returned. I know I could trim off any remaining words manually, but I’m hoping there’s a better built-in solution.

To elaborate, I want the autocomplete to match up to two words.

mbekhta · June 26, 2024, 6:53am

Usually, for auto-complete kind of scenarios edge ngram or ngram filters are used at index time.

As to this part:

I’m not sure I understand what you are trying to achieve, but if you want the search to be performed only on the first two words of the title, e.g., for a title High performance autocomplete optimization, only autocomplete on High performance. If that’s so - you can add a limit token count token filter. Just make sure that the limit filter is added before the ngram filter.

And just in case, here’s how to configure the analyzers with Hibernate Search:

fivearise · June 26, 2024, 3:30pm

That looks good! You know what I’m looking for

Do you have the relevant documentation of ngram and the token limit for Lucene? The closest thing I could find it @KeywordField.

mbekhta · June 26, 2024, 4:10pm

ohh ok so for the Lucene you’d want to implement LuceneAnalysisConfigurer see this example to get started.

As for the filters in the Lucene case you just have to look through the Lucene packages, in your case these are the ones you were looking for:

org.apache.lucene.analysis.ngram.NGramFilterFactory : nGram
- parameters: minGramSize / maxGramSize
org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory : edgeNGram
- parameters: minGramSize / maxGramSize
org.apache.lucene.analysis.miscellaneous.LimitTokenCountFilterFactory : limitTokenCount
- parameters: maxTokenCount / consumeAllTokens

and it’ll look something like the example below (hibernate.search.backend.analysis.configurer):

Create your configurer (don’t forget to pass it to Search, see the link above):

public class YourAnalysisConfigurer implements LuceneAnalysisConfigurer {
	@Override
	public void configure(LuceneAnalysisConfigurationContext context) {
		context.analyzer( "someAnalyzerName" ).custom()
				.tokenizer( WhitespaceTokenizerFactory.class )
				// add some filters to clean up your text:
				.tokenFilter( StopFilterFactory.class )
				.tokenFilter( LowerCaseFilterFactory.class )
				// after these ^ filters you are expecting to have the words you want to auto-complete on
				// now add the limit filter to only keep two words
				.tokenFilter( LimitTokenCountFilterFactory.class )
				.param( "maxTokenCount", "2" )
				// add the ngram / or edgengram filter to generate the tokens
				.tokenFilter( EdgeNGramFilterFactory.class )
				.param( "minGramSize", "2" )
				.param( "maxGramSize", "15" );

		// same as above config but without the ngram filter, you'd use this one to be applied as a search analyzer
		context.analyzer( "someAnalyzerNameSearch" ).custom()
				.tokenizer( WhitespaceTokenizerFactory.class )
				// add some filters to clean up your text:
				.tokenFilter( StopFilterFactory.class )
				.tokenFilter( LowerCaseFilterFactory.class );
	}
}

and use apply these analyzers to your entity:

@Entity
@Indexed
public class MyEntity {

	// ... other things

	@FullTextField(analyzer = "someAnalyzerName", searchAnalyzer = "someAnalyzerNameSearch")
	private String myText;

}

fivearise · June 26, 2024, 4:20pm

This looks perfect! How would I perform the search? Is it as simple as entity.match().field("myText").matching(userInput)? (Sorry this is all very new to me )

fivearise · June 26, 2024, 5:53pm

OK, I got the autocomplete to work thanks to you, ty! I have another question though: is it possible to conditionally apply the analyzer on certain queries?

For instance, I want the autocomplete to use the analyzer, but I want the actual search to use the default FullTextField analyzer with fuzzy search. Should I create two columns with the same value, and apply the analyzer to one of them?

mbekhta · June 27, 2024, 7:12am

It is common to have multiple index fields with different analysis configuration derived from the same entity field. In other words, you can just add multiple full-text annotations to it like:

@Entity
@Indexed
public class MyEntity {

	// ... other things

	// Use this one for autocomplete query 
	// (make sure it has a name specified so that each field has a unique one).
	@FullTextField(name = "myTextAutocomplete", analyzer = "someAnalyzerName", searchAnalyzer = "someAnalyzerNameSearch")
	// Use this field for a search queries where autocomplete is not needed.
	@FullTextField(analyzer = "someOtherAnalyzerNameOrNoneIfDefaultIsGood")
	private String myText;

}

horvoje · July 4, 2024, 11:00pm

What about autocomplete on multiple fields (product name, manufacturer name, provider name, provider city, provider country)?

Right now I have a String property “autocomplete” inside product entity and I populate it during entity persisting process.
Later I use MySQL stored procedure which uses LIKE operator.

So if you have a product “Almond Milk” from manufacturer “Alpro” sold in a store “Aldi” in Killorglin, Ireland - you can get this product using first letters of any of these data (“alp”, “ire” and so on).
When you type another word, stored procedure adds AND operator and another LIKE.

Is there a way to get the same behaviour by Lucene?

mbekhta · July 8, 2024, 2:58pm

Hey @horvoje

you could have these fields you are combining annotated with a @FullTextField and use ngram filter in the analyzer for them, e.g. something like:

@FullTextField(name = "productAutocomplete", analyzer = "someAnalyzerName", searchAnalyzer = "someAnalyzerNameSearch")
String product;
@FullTextField(name = "manufacturerAutocomplete", analyzer = "someAnalyzerName", searchAnalyzer = "someAnalyzerNameSearch")
String manufacturer;
.... other fields

And then use aquery string predicate targeting multiple fields (Hibernate Search 7.1.1.Final: Reference Documentation):

List<Product> hits = searchSession.search( Product.class )
        .where( f -> f.simpleQueryString()
                .field( "product" ).field( "manufacturer" )
                .matching( "alp" ) )
        .fetchHits( 20 );

for this part:

stored procedure adds AND operator and another LIKE

you’d just use the AND operator option on a simple query string predicate (Hibernate Search 7.2.0.Alpha2: Reference Documentation):

List<Product> hits = searchSession.search( Product.class )
        .where( f -> f.simpleQueryString()
                .field( "product" ).field( "manufacturer" )
                .matching( "alp" )
                .defaultOperator( BooleanOperator.AND ) )
        .fetchHits( 20 );

dinaadams · July 10, 2024, 9:27am

Hello,
Thanks for sharing, this is very helpful for me.

Topic		Replies	Views
Specifying both "analyzer" and "searchAnalyzer" for a FullTextField breaks the search Hibernate Search	1	610	January 9, 2023
Hibernate Search 6 TokenFilterDef, Could i use `token_chars` of NGramFilterFactory to define ngram? Hibernate Search	7	1820	October 1, 2019
Autocomplete with Hibernate Search 6 Hibernate Search	6	2267	September 8, 2022
FulltextField + Analyzer & Aggregation Hibernate Search	4	2060	May 6, 2020
How to use Hibernate Search to instead SQL like in a faster way? Hibernate Search	10	4211	May 10, 2019

High performance autocomplete optimization

Related topics