Slop does not work for <any word>

I am trying to implement a phrase query. I want it to behave with some flexibility around search terms (like a google search) i.e. Single term searching is fine, and an exact match is fine but if someone makes a typo, then I should still be able to find the same/similar data.

I discovered the slop factor. According to the Hibernate Search 7.1.0 documentation:

So quick fox with a slop of 1 can become quick <word> fox, where <word> can be any word.

It states that ANY word can be inserted. BUT this is not true from my tests. It has to be the exact word in a precise search, or else it will not work.

E.g. If this is the phrase: “Give good news to those who patiently” with a slop factor of 2,

  1. The original phrase will find a record.
  2. If I remove the first word “Give”, and use: “good news to those who patiently”, it will find the same record.
  3. If I reverse Give, and good, like so: “good Give news to those who patiently”, it will find the same record.
  4. If I use number 3’s arrangement but change “Give” to “Live” i.e. ANY word, it fails. e.g. “good Live news to those who patiently”.
  5. If I used a slop of 3, and move “Give” two words to the right, it works. e.g. “good news Give to those who patiently” BUT NOT if I replace give with live.

The documentation is misleading. It only uses the exact word in a different order. How can I actually use ANY word? Again, I am trying to achieve a google like search, with even words being inferred if they are mistyped (any hints?). But trying to understand slop for now.

With the phrase slop, the transformation operations are about the text in the indexed document, i.e. the indexed document can contain those “additional words” (<word>), or the words can be in a different order from the one in the search query. But the words from the phrase string in a search query – all of those should be in the document. From your example, if you’d have a document “good Live news to those who patiently” and then searched for good news or news good then that document would be found, but if you’d search for "some good news" – no matches since "some" is not in that indexed document.

Then for typos in words you might want to take a look at the fuzzy parameter. And if there may be words in a search string that won’t be in the indexed documents – you may want to consider looking at a simple query string predicate with an OR as a default operator.

1 Like

I thought about your suggestion and I came up with a boolean query where I consider a phrase query AND a Simple Query String using And:

SearchResult<MyStoryProjection> hits = searchSession.search( Story.class )
				.select( MyStoryProjection.class )
				.where( f -> f.bool() 
		                .should( f.phrase().field( "story")
		    	                .matching( searchParam ).slop(3)
		                )
		                
						.should(f.simpleQueryString()
						
		        		.field("story")
		        		
		        		//fuzzy
		        		.matching(searchParam+"~3")
						.defaultOperator( BooleanOperator.AND  )))
		                		
		        .fetch(offset, recordsPerPage);

Since I am looking to mimic a google search, the default case, Simple Query is an AND, so at least it can find the same tokens in a document. However, this is not ideal. A phrase query is a better solution, because the user is searching by phrase, so it’s my first condition.

I’m trying to understand the results I’m getting:

  1. User input: “Peace Grows Stronger”, I get 5 records, only the first one coming from a phrase, the rest just have the words in different places.
  2. User input: “Peacee Grows Stronger”, I get 0 records. I added an extra “e” to Peace. BUT I added a fuzziness factor of ~3 to my matching search param as per the documentation. I did some debugging and noticed this: [+story:peac, +story:grow, +story:stronger~2] and so the first 2 search terms don’t benefit from fuzziness, only the last token. Is there a way to apply ~3 to the whole phrase?
  3. I also can’t search Peace grew Stronger i.e. “grew” is a form of grow, but how can I make Hibernate Search understand that? I am using a slop factor, I tried with 3 and 5 but it doesn’t work. To me, this is weak. It should be easily able to locate the same records if the word “grow” changes to “grew.”

Any ideas how I can tweak this? Thanks.

You added the fuzziness factor to the third term, not to Peacee. The string you’re passing to Hibernate Search is Peacee Grows Stronger~3.

As expected, see above.

Not with simpleQueryString, no. At least no built-in way. I suppose you can always split the string and add ~3 everywhere.

You could configure fuzziness through the DSL with the match predicate, but then you lose the ability to require an AND (at least until HSEARCH-917 gets addressed), so it’s probably not what you’re looking for.

What you’re after is called stemming. It’s configured as part of analysis, not just at search time.
I’m not sure the available stemmers handle grew/grow, but there’s a chance they do.

See Getting started with Hibernate Search in Hibernate ORM, it talks about Analysis in general and a bit about stemming.

Fuzziness is indeed an inferior solution if your goal is extensive resilience to user typos.

I’d personally recommend:

  1. Relying on a score sort to get the best results first – that’s the default.
  2. Using multiple predicates of decreasing specificity and boost to match all the documents you’re looking for, but assign a better score to documents that match more exactly – that’s what you started doing.
  3. As the last (least boosted) predicate, using a match predicate on a dedicated index field with an ngram filter. The idea of this filter is that it will e.g. turn peacee into pea, eac, ace, cee and peace into pea, eac, ace. Thus peacee will match peace, as will any other term with at least three characters in common with what’s indexed, covering for typos; the trick is that typos will lead to fewer tokens matching, thus a lower score, and so you’ll still get the best matches higher up in the list of results.
1 Like

@yrodiere I’ve done some testing and I feel very unsure right now about Hibernate Search as analogous to Google Search.

So help me understand scoring and boosting better. For my phrase query, I boosted it like so:

.where( f -> f.bool() 
		                .should( f.phrase().field( "story")
		    	                .matching( searchParam ).slop(3).boost(2.0f)
		                )

That didn’t change the results I was getting so I don’t see the benefit of boosting this right now. Mind you, I am experimenting with one field which is the main field people will be searching for.

What do you mean by a “dedicated index field.” I am searching for stories (non-fiction) users submit. I only have analyzer, hence one index (I think one only?) per field, like so:

	@Lob @Column(columnDefinition = "TEXT")
	@FullTextField(analyzer = "english", projectable = Projectable.NO, searchAnalyzer = "searchAnalyzer")
	private String story;

so how am I to establish an additional index if I am understanding correctly?

Moreover, I played with the ngram filter. You example of using “pea, eac, ace” I’m understanding uses a min of 3 for minGramSize? I tried it with different combinations, like 3,5. or 3,6, etc. it didn’t work (this is for phrase: peacee grows stronger) BUT when I tried minGramSize:4 and maxGramSize:8, I got back exactly one result, for the phrase query.

Which is so confusing. When you type the exact phrase in: “Peace Grows Stronger”, it doesn’t satisfy the phrase condition (which it should because it’s an exact match), it goes to the next OR condition:

and returns results where the tokens "peace", "grows", "stronger" are found.

I also noticed if I keep both the ngram filter ON and fuzziness (~1), I get only one result, and that depending how I arrange the terms. If I turn off fuzziness, and just ngram ON, I get 4 results, which again is ANDing all the terms. Yet again, if I search for "peace grows stronger" (no typos), it doesn't hit the phrase query. How come?




						.should(f.simpleQueryString()
						
		        		.field("story")
		        		
		        		//fuzzy
		        		.matching(returnTokensUpdatedWithFuzzy(searchParam,1))
						.defaultOperator( BooleanOperator.AND  )))
						.sort(f -> f.field("dateTime_sort").desc())
		                		
		        .fetch(offset, recordsPerPage);

At this point I’m really confused. Even when I turn off the ngram filter now, it still doesn’t it the phrase query for the exact phrase. Did the index get corrupted? I delete it each time I create a build.

Ideally what I need is a phrase query with some forgiveness (like fuzziness), basically just like a google search. And if that fails, then it can default to an AND query where at least all the terms will be in the search.

Lastly, instead of having a bool Should, maybe I should use a MUST so that even if the phrase query works, I can get additional terms from the AND condition in the next MUST.

Suggestions on what I can do here?

I mean this:

	@Lob @Column(columnDefinition = "TEXT")
	@FullTextField(analyzer = "english", searchAnalyzer = "searchAnalyzer")
	@FullTextField(name = "story_ngram", // Dedicated index field for NGrams
                     analyzer = "english_ngram", searchAnalyzer = "english_ngram")
	private String story;

Yes. min=3/max=3 in my example.

You need to use the ngram filter for both indexing (analyzer =...) and search (searchAnalyzer = ...). Given your snippet above I suspect that’s not what you did?

Then try to run analysis manually on the relevant pieces of text, in order not to be confused.
With Elasticsearch: Analyze API | Elasticsearch Guide [8.13] | Elastic .
With Lucene you’ll have to retrieve the analyzer and call it manually.

FYI, in Hibernate Search 7.2 you’ll be able to do it directly using Hibernate Search: [HSEARCH-4963] - Hibernate JIRA

That might be because you ran the phrase query against a field that uses the ngram analyzer. I doubt that can work.
You must have two index fields, and use them more or less like this:

SearchResult<MyStoryProjection> hits = searchSession.search( Story.class )
				.select( MyStoryProjection.class )
				.where( f -> f.bool() 
		                .should( f.phrase().field( "story")
		    	                .matching( searchParam )
		    	                .boost(10.0f) // requested words, same order => big boost
		                )
		                .should( f.phrase().field( "story")
		    	                .matching( searchParam ).slop(3)
		    	                .boost(5.0f) // requested words, different order => smaller boost
		                )
		                .should( f.match().field( "story")
		    	                .matching( searchParam )
		    	                .boost(2.0f) // requested words, not contiguous => even smaller boost
		                )
		                .should( f.match().field( "story_ngram") // Different field with different analyzer!
		    	                .matching( searchParam )
		    	                .boost(0.5f) // requested words with typos => smallest boost
		                )
		        )
		        .fetch(offset, recordsPerPage);

That’s a “confusing” idea. I wouldn’t do that.

Most likely not; if it was, you’d be seeing exceptions everywhere.

It could possibly be out of sync, but if you recreate it every time you change your mapping, you should be fine.

Google spent millions (billions?) of dollars on their search engine.

You are not going to do the exact same thing.

You are not going to find a ready-to-use solution “like a google search”.

You can approach it, though, by tuning your indexing and queries. See above.

This was very useful @yrodiere. Why doesn’t the documentation contain more real-life examples like this?!

Based on your suggestion, I tried this:

SearchResult<MyStoryProjection> hits = searchSession.search( Story.class )
				.select( MyStoryProjection.class )
				
				.where( f -> f.bool() 
		                .should( f.phrase().field( "story")
		    	                .matching( searchParam )
		    	                .boost(10.0f) // requested words, same order => big boost
		                )
		                .should( f.phrase().field( "title")
		    	                .matching( searchParam )
		    	                .boost(10.0f) // requested words, same order => big boost
		                )
		                .should( f.phrase().field( "story")
		    	                .matching( searchParam ).slop(3)
		    	                .boost(5.0f) // requested words, different order => smaller boost
		                )
		                .should(f.simpleQueryString()
				        		.field("story")
				        		.matching(searchParam)
								.defaultOperator(BooleanOperator.AND).boost(2.0f)
						)
		                .should(f.simpleQueryString()
				        		.field("story")
				        		.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
				        		//.matching(searchParam)
				        	.defaultOperator(BooleanOperator.AND).boost(1.0f).constantScore()
						)
		               
		                
		                .should(f.simpleQueryString()
				        		.field("storyNgram")
				        		//.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
				        		.matching(searchParam)
								.defaultOperator(BooleanOperator.AND).boost(0.5f)
						)
		        )
				.fetch(offset, recordsPerPage);

And here is the indexes:

	@Lob @Column(columnDefinition = "TEXT")
	@FullTextField(analyzer = "english", projectable = Projectable.NO, searchAnalyzer = "english")
	@FullTextField(name = "storyNgram", analyzer = "nGramAnalyzer", projectable = Projectable.NO, searchAnalyzer = "nGramAnalyzer")
	private String story;

So to begin with, I followed your example and updated the Story field (above).

I tested this all out and made some observations:

  1. To get the search as close as possible to the search term(s), I want the words to be contiguous if possible. i.e. an OR approach does not make sense. This gives me far more search results with mostly just a single term matching which is not great. So instead of this:
.should( f.match().field( "story")
		    	                .matching( searchParam )
		    	                .boost(2.0f) // requested words, not contiguous => even smaller boost
		                )

I went with the simpleQuery approach where I can force all terms to at least exist in the same document. This is based on your suggestion of scoring this type of predicate less.

		                .should(f.simpleQueryString()
				        		.field("story")
				        		.matching(searchParam)
								.defaultOperator(BooleanOperator.AND).boost(2.0f)
						)
  1. Next I wanted to incorporate a fuzzy match in case of typos. This is different than an ngram prefix because for ngram, a user is relying on a specific word stem to be spelt correctly in order to trigger an ngram. But naturally with typos, an ngram prefix might never be typed. So I did this:
		                .should(f.simpleQueryString()
				        		.field("story")
				        		.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
				        		//.matching(searchParam)
				        	.defaultOperator(BooleanOperator.AND).boost(1.0f).constantScore()
						)

I am manually breaking up the search phrase into tokens and then adding a tilda with the edit distance like so:

peacee~2 grows~2 stronger~2

My first inquiry:

  1. If I had a phrase like: “Peace in the UK” it wouldn’t yield great results. e.g. UK~2 can be replaced by any 2 letter word and hence we would get irrelevant results. I heard there’s an exact-prefix length concept. Could I programmatically force a check as to the minimum number of characters per token word in order to execute this predicate? E.g. each token word in a phrase must be 5 characters.
  2. The other thing is, how would this or wouldn’t affect STOP words? Should a STOP filter (StopFilterFactory.class) be used? I am currently using one like so:
		context.analyzer( "searchAnalyzer" ).custom()
		.tokenizer( StandardTokenizerFactory.class )
		.tokenFilter( LowerCaseFilterFactory.class )
		.tokenFilter( SnowballPorterFilterFactory.class ).param( "language", "English" )
		.tokenFilter( ASCIIFoldingFilterFactory.class )
                .tokenFilter(StopFilterFactory.class)
		.charFilter(HTMLStripCharFilterFactory.class);

If I create a string like: “peace~2 in~2 the~2 uk~2” I am including stop words “in” and “the”. Will that mess up the analysis? I get a lot of irrelevant records back.

My second inquiry is for the next nGram predicate:

		                .should(f.simpleQueryString()
				        		.field("storyNgram")
				        		//.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
				        		.matching(searchParam)
								.defaultOperator(BooleanOperator.AND).boost(0.5f)
						)

Similar to the previous fuzzy approach, I want to make this a conditional predicate. I found in my testing that even though I have asked this predicate to be executed as an AND, if search phrase terms are less than the minimum ngram length, then they are omitted and the search might end up looking like an OR. E.g. “Peace in the UK” ends up looking like storyNgram:peac

E.g. for minGramSize= 4, the following phrase: “peace in the uk” gives me 192 hits which is largely irrelevant. Almost 30% of my database is returned.

So can I skip this predicate altogether if all the storyNgrams are not at least minGramSize? If so, how can I enforce this?

Otherwise, it looks like there will be an extremely high amount of irrelevant results. I’m almost feeling like these last two predicates are erratic. They work at times, but usually in a more useless way.

		                .should(f.simpleQueryString()
				        		.field("story")
				        		.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
				        		//.matching(searchParam)
				        	.defaultOperator(BooleanOperator.AND).boost(1.0f).constantScore()
						)
		               
		                
		                .should(f.simpleQueryString()
				        		.field("storyNgram")
				        		//.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
				        		.matching(searchParam)
								.defaultOperator(BooleanOperator.AND).boost(0.5f)
						)

Having a conditional clause around these would be great. Is something like this possible, if so, how would I check for the right conditions (minimum word length for fuzzy, minimum ngram word length for ngram search) ELSE skip.

E.g. something like this doable in should clauses?

if ( patternParts.length > 0 ) {
                    and.add( f.wildcard()
                            .field( "departmentCode" ) 
                            .matching( patternParts[0] ) );
                }
                if ( patternParts.length > 1 ) {
                    and.add( f.wildcard()
                            .field( "collectionCode" )
                            .matching( patternParts[1] ) );
                }
                if ( patternParts.length > 2 ) {
                    and.add( f.wildcard()
                            .field( "itemCode" )
                            .matching( patternParts[2] ) );
                }

Because it’s a reference documentation, not a tutorial or a howto.

Also, because nobody contributed such examples, and the team can’t spend their nights writing ever more documentation. I don’t know if you noticed, but the current documentation is already a 400-page PDF.

And finally, because this Discourse thread right here will serve as documentation for the next person with a similar problem.

This is the wrong approach if you want fuzzy matches. You won’t be able to use the AND operator and provide fuziness in the same query.

Instead, you should either:

  • keep the predicate with the OR operator, and add another predicate with the AND operator, with a greater boost.
  • OR have two queries: a first query with the AND operator, and a second one with the OR operator. If the one with the AND operator returns hits, you just return that and don’t execute the second one. If the one with the AND operator doesn’t return any hits, you run the second one and return its hits, with some hint on the web interface that you didn’t run the exact same query. Like, you know. Google.

See above. If you weren’t using the AND operator, ngrams would work just fine.

I don’t know, I don’t use fuzzy(), because as mentioned they are too limited to my taste. Maybe someone else will chime in.

Regardless of whether you use fuzzy() or not, if you use the AND operator you should probably use a stop words filter, yes. Even with the OR operator, it’s generally a good default.

That’s for you to tune. You want fuziness, fuziness means more not-exactly-matching results. How much you will get is up to you and your min/max gram size.

Use a token filter to remove tokens with fewer than 4 characters. There should be one; see Hibernate Search 7.1.1.Final: Reference Documentation or Hibernate Search 7.1.1.Final: Reference Documentation for links to documentation about available filters and their options.

minimumShouldMatch may do something similar to what you want (though not identical), but it’s only available in the simpleQueryString predicate: Hibernate Search 7.1.1.Final: Reference Documentation

I tried to find an IF/ELSE example but I could not that pertained to my predicates already. Are you saying I can add this within the predicates I have or will this be separate? Can you please provide an example using the existing predicates.

I could not find an option for my token filter SnowballPorterFilterFactory, to only process a minimum number of icons (I’m using Lucene). I saw something from SOLR only. e.g.

<filter class="solr.LengthFilterFactory" min="2" max="7"/>

But I’m not sure how to use this. Please provide an example.

@yrodiere @mbekhta anyone, wondering if I can get an update on my last query? Appreciate it.

IF/ELSE are Java constructs. I think you can find them :slight_smile:

The reference documentation tells you how to use token filters:

https://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#backend-lucene-analysis-analyzers

It also points you to the javadoc of Lucene filters:

https://lucene.apache.org/core/9_9_2/core/

https://lucene.apache.org/core/9_9_2/core/org/apache/lucene/analysis/package-summary.html

Though it seems the filter you need is in another javadoc: Overview (Lucene 9.10.0 common API), specifically here.

@mbekhta we probably need to add links to the javadoc of luene-analysis-common in the reference documentation? Perhaps even direct links to the package people should look into.

That’s not how it works. You show what you tried first.

Perhaps I can try explaining this a different way: I’m trying to understand where to add the IF/ELSE based on what you said and cross-referencing it with the reference documentation because as I said before, I can’t find any examples. Specifically, you said:

i.e. I am not clear where to check if hits are returned for a specific predicate because everything (all predicates) are computed in totality one time. e.g. How would I know if this isolated predicate returns hits in a list of other predicates using IF or IF/ELSE?

		                .should(f.simpleQueryString()
				        		.field("story")
				        		.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
				        		//.matching(searchParam)
				        		
								.defaultOperator(BooleanOperator.AND).boost(1.0f).constantScore()
						)

My objective at this point is to only run this if:

  1. all tokens have a minimum of X characters, where X is a customized minimum value (I will look into token filtering next) otherwise skip for the following two predicates:

I essentially want to say,

if (all search tokens have a minimum of X characters), then execute the following two blocks. i.e. I don't care about hits immediately, but that every predicate has a chance to run respecting any conditions.
		                .should(f.simpleQueryString()
				        		.field("story")
				        		.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
				        		.defaultOperator(BooleanOperator.AND).boost(1.0f).constantScore() //fuzzy matching
						)
		                
            
		                .should(f.simpleQueryString()
				        		.field("storyNgram")
				        		.matching(searchParam)
								.defaultOperator(BooleanOperator.AND).boost(0.5f) //ngram, minGramSize 4, maxGramSize 4
						)

mbekhta Update, I’ve added:

.tokenFilter(LengthFilterFactory.class).param("min", "4").param("max", "4");

on the fuzzy predicate. My observation is that filtering likely was triggered, but because I constructed the fuzzy string manually for string: “Peace in the uk”, it came out like this:

+story:peace~2 +story:in~2 +story:the~2 +story:uk~2

Whereas the other predicates looks like this:

story:"peac ? ? uk"
title:"peac ? ? uk"
story:"peac ? ? uk"~3

I got way less results this time (9 vs. 149) so is it ignoring:

+story:peace~2 +story:in~2 +story:the~2 +story:uk~2

??

The challenge is no stop words are removed in this case. So the conclusion is if I cannot use fuzzy on ALL terms with a MIN length, skip? Because I noticed that if your MIN is more than the shortest word, that word is removed, but then the query no longer stays authentic and you get an OR scenario which I don’t want.

The same will go for the ngram scenario. My minNgram is 4, but it gives me back:

storyNgram:peac

for phrase: “peace in the uk”

This gives me over 100 results! Basically it ignores “uk” which is a key piece to the query and just searches for “peac”

So would it make sense to first run an IF statement to check if ALL TERMS have a MIN length before executing the predicates? But again, if I do that, it will not account for STOP words.

Hey,

This is quite a lengthy discussion you have here so I may be missing some of the context here, but from what I see, the IF/ELSE bit comes from this original suggestion:

which is just suggesting (in pseudo-java):

var hist = searchSession.search( Story.class ).where("first query using AND")....
if (hits.isEmpty()) {
   hist = searchSession.search( Story.class ).where("second query using OR")....
   return markResultsNotForTheOriginalQuery(hits);
}
return hits;

I think @yrodiere has suggested this earlier, and I just want to emphasise on that: you should setup a test for your analyzer configurations since what you are trying to do is not a trivial lowercase -> remove-stopwords scenario. Here’s an idea of how this can be done with Lucene while Hibernate Search 7.2 is not released: hibernate-search/backend/lucene/src/main/java/org/hibernate/search/backend/lucene/analysis/impl/LuceneAnalysisPerformer.java at d35863ffca40fa551df73df331025ebfd38f80a9 · hibernate/hibernate-search · GitHub

Set up the test, pass your text and search queries through the analyzer, and see what tokens are produced to better understand what’s happening with your configuration.

I was speaking in terms of multiple SHOULD clauses, on whether I can use IF/ELSE, see above. Thoughts?

I have also shared my answer, it’s just above yours. Thoughts? i.e. “Peace in the UK” with a fuzzy factor of 2 can completely alter “UK” to any two-letter word, which is undesirable. I think in situations where there are not many characters, skipping the entire SHOULD clause might be the way, if so, my original question was how can I incorporate an IF and a SHOULD together based on knowing token length of each term (here I can use a custom function to return length of each token but it won’t account stop words. I need a simple stop word and lowercase filter, which is out of the box, do I still need to use

?

If you want to better understand what you’ve configured with analyzers and what tokens are produced then go ahead and set up the test and see what you are getting; if not, don’t.

There’s no if/else inside should clauses. But then, the original suggestion was about running two different queries, with the second one executed only if the first one hasn’t produced any results.

Since you are already splitting the search string and adding a fuzziness factor to each word … just don’t add it to the words that are less than some-number characters long?