This was very useful @yrodiere. Why doesn’t the documentation contain more real-life examples like this?!
Based on your suggestion, I tried this:
SearchResult<MyStoryProjection> hits = searchSession.search( Story.class )
.select( MyStoryProjection.class )
.where( f -> f.bool()
.should( f.phrase().field( "story")
.matching( searchParam )
.boost(10.0f) // requested words, same order => big boost
)
.should( f.phrase().field( "title")
.matching( searchParam )
.boost(10.0f) // requested words, same order => big boost
)
.should( f.phrase().field( "story")
.matching( searchParam ).slop(3)
.boost(5.0f) // requested words, different order => smaller boost
)
.should(f.simpleQueryString()
.field("story")
.matching(searchParam)
.defaultOperator(BooleanOperator.AND).boost(2.0f)
)
.should(f.simpleQueryString()
.field("story")
.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
//.matching(searchParam)
.defaultOperator(BooleanOperator.AND).boost(1.0f).constantScore()
)
.should(f.simpleQueryString()
.field("storyNgram")
//.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
.matching(searchParam)
.defaultOperator(BooleanOperator.AND).boost(0.5f)
)
)
.fetch(offset, recordsPerPage);
And here is the indexes:
@Lob @Column(columnDefinition = "TEXT")
@FullTextField(analyzer = "english", projectable = Projectable.NO, searchAnalyzer = "english")
@FullTextField(name = "storyNgram", analyzer = "nGramAnalyzer", projectable = Projectable.NO, searchAnalyzer = "nGramAnalyzer")
private String story;
So to begin with, I followed your example and updated the Story field (above).
I tested this all out and made some observations:
- To get the search as close as possible to the search term(s), I want the words to be contiguous if possible. i.e. an OR approach does not make sense. This gives me far more search results with mostly just a single term matching which is not great. So instead of this:
.should( f.match().field( "story")
.matching( searchParam )
.boost(2.0f) // requested words, not contiguous => even smaller boost
)
I went with the simpleQuery approach where I can force all terms to at least exist in the same document. This is based on your suggestion of scoring this type of predicate less.
.should(f.simpleQueryString()
.field("story")
.matching(searchParam)
.defaultOperator(BooleanOperator.AND).boost(2.0f)
)
- Next I wanted to incorporate a fuzzy match in case of typos. This is different than an ngram prefix because for ngram, a user is relying on a specific word stem to be spelt correctly in order to trigger an ngram. But naturally with typos, an ngram prefix might never be typed. So I did this:
.should(f.simpleQueryString()
.field("story")
.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
//.matching(searchParam)
.defaultOperator(BooleanOperator.AND).boost(1.0f).constantScore()
)
I am manually breaking up the search phrase into tokens and then adding a tilda with the edit distance like so:
peacee~2 grows~2 stronger~2
My first inquiry:
- If I had a phrase like: “Peace in the UK” it wouldn’t yield great results. e.g. UK~2 can be replaced by any 2 letter word and hence we would get irrelevant results. I heard there’s an exact-prefix length concept. Could I programmatically force a check as to the minimum number of characters per token word in order to execute this predicate? E.g. each token word in a phrase must be 5 characters.
- The other thing is, how would this or wouldn’t affect STOP words? Should a STOP filter (StopFilterFactory.class) be used? I am currently using one like so:
context.analyzer( "searchAnalyzer" ).custom()
.tokenizer( StandardTokenizerFactory.class )
.tokenFilter( LowerCaseFilterFactory.class )
.tokenFilter( SnowballPorterFilterFactory.class ).param( "language", "English" )
.tokenFilter( ASCIIFoldingFilterFactory.class )
.tokenFilter(StopFilterFactory.class)
.charFilter(HTMLStripCharFilterFactory.class);
If I create a string like: “peace~2 in~2 the~2 uk~2” I am including stop words “in” and “the”. Will that mess up the analysis? I get a lot of irrelevant records back.
My second inquiry is for the next nGram predicate:
.should(f.simpleQueryString()
.field("storyNgram")
//.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
.matching(searchParam)
.defaultOperator(BooleanOperator.AND).boost(0.5f)
)
Similar to the previous fuzzy approach, I want to make this a conditional predicate. I found in my testing that even though I have asked this predicate to be executed as an AND, if search phrase terms are less than the minimum ngram length, then they are omitted and the search might end up looking like an OR. E.g. “Peace in the UK” ends up looking like storyNgram:peac
E.g. for minGramSize= 4, the following phrase: “peace in the uk” gives me 192 hits which is largely irrelevant. Almost 30% of my database is returned.
So can I skip this predicate altogether if all the storyNgrams are not at least minGramSize? If so, how can I enforce this?
Otherwise, it looks like there will be an extremely high amount of irrelevant results. I’m almost feeling like these last two predicates are erratic. They work at times, but usually in a more useless way.
.should(f.simpleQueryString()
.field("story")
.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
//.matching(searchParam)
.defaultOperator(BooleanOperator.AND).boost(1.0f).constantScore()
)
.should(f.simpleQueryString()
.field("storyNgram")
//.matching(returnTokensUpdatedWithFuzzy(searchParam,2))
.matching(searchParam)
.defaultOperator(BooleanOperator.AND).boost(0.5f)
)
Having a conditional clause around these would be great. Is something like this possible, if so, how would I check for the right conditions (minimum word length for fuzzy, minimum ngram word length for ngram search) ELSE skip.
E.g. something like this doable in should clauses?
if ( patternParts.length > 0 ) {
and.add( f.wildcard()
.field( "departmentCode" )
.matching( patternParts[0] ) );
}
if ( patternParts.length > 1 ) {
and.add( f.wildcard()
.field( "collectionCode" )
.matching( patternParts[1] ) );
}
if ( patternParts.length > 2 ) {
and.add( f.wildcard()
.field( "itemCode" )
.matching( patternParts[2] ) );
}