Can Someone Please help me out? I am stucked at wildcard search with special characters using StandardTokenizerFactory

Kumar_Shikhar · June 8, 2020, 7:08am

Can Someone Please help me out?
I am stuck at wildcard search with special characters using StandardTokenizerFactory.
Is there any way to index my data with special characters using StandardTokenizerFactory.

yrodiere · June 8, 2020, 7:45am

I your special characters are considered as “punctuation” by the standard tokenizer (i.e. they separate words/sentences), then no, you cannot index these characters while using StandardTokenizerFactory.

User another tokenizer, for example WhitespaceTokenizerFactory.

Note that if you need both, you can always map multiple fields to the same property, and assign a different analyzer to each field. Then at query time you will use whichever field makes more sense for your particular query.

Example with Hibernate Search 6:

public class MyEntity {

    // ...

    // Creates a field named "myText"
    @FullTextField(analyzer = "someAnalyzerWithStandardTokenizerFactory")
    // Creates a field named "myText_forWildcards"
    @FullTextField(name = "myText_forWildcards", analyzer = "someAnalyzerWithWhitespaceTokenizerFactory")
    private String myText;

    // ...
}

Example with Hibernate Search 5:

public class MyEntity {

    // ...

    // Creates a field named "myText"
    @Field(analyzer = @Analyzer(definition = "someAnalyzerWithStandardTokenizerFactory"))
    // Creates a field named "myText_forWildcards"
    @Field(name = "myText_forWildcards", analyzer = @Analyzer(definition = "someAnalyzerWithWhitespaceTokenizerFactory")
    private String myText;

    // ...
}

Kumar_Shikhar · June 8, 2020, 9:39am

@yrodiere I am unable to understand the solution.
My requirement is like i want to index special characters with StandardTokenizerFactory,
As you said it’s not possible. But i am indexing my data from db to indexes using StandardTokenizerFactory, so i need some solution which will omit the special characters also in indexes so that i can apply wildcard on special characters as well.

yrodiere · June 8, 2020, 10:06am

There was a typo in my example, I fixed it.

Regardless… If you’re new to Hibernate Search or more generally to Lucene, I’d recommend you stay away from the wilcard queries. They probably don’t behave like you think they do.

The main problem is that wildcard queries ignore analysis completely. It is probably what brought you here: when a user searches for foo.*, the wildcard query will not apply analysis and thus will not drop the traling dot (.). It will then look for documents containing foo. (with a trailing dot), and won’t find anything since your analyzer dropped all the dots at indexing time.

The easiest (but still a bit complex) alternative to wildcard queries is to use an EdgeNGramFilterFactory. You will find details about this solution in this stackoverflow answer.

Kumar_Shikhar · June 8, 2020, 10:16am

@yrodiere can you please let me know in this given example of yours where should i put the definition of both the analyzers in my code?

yrodiere · June 8, 2020, 10:20am

Wherever you usually put your analyzer definitions.

See the documentation: https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#_analyzer

Kumar_Shikhar · June 8, 2020, 10:33am

One more question , do we need to create an extra column “myText_forWildcards” in database for particular field which you have mentioned in example?

Kumar_Shikhar · June 8, 2020, 11:10am

@yrodiere Thanks a lot man !
Your Approach helped me out.

yrodiere · June 8, 2020, 11:42am

No you don’t. You can perfectly have one column in database but two fields in the index.

Kumar_Shikhar · June 13, 2020, 10:05am

@yrodiere Do we have any functionality in hibernate search/lucene so that we can make the wildcard search at the place of whitespaces as well. As of now i have two analyzers in my code(1.Standard 2. Whitespacetokenizer. I have a record like this MILLPOOL COTTAGE (FIRST FLOOR FLAT) , so for this record wildcard works perfectly fine at every place in word and special char as well but as soon i put a wildcard at place of whitespace it fails to give me result. I am searching it like this MILLPOOL COTTAGE?(FIRST FLOOR FLAT). Please suggest if you have a valid approach for this.

yrodiere · June 16, 2020, 6:16am

If want your search to consider whitespace just as any other character, it basically means you don’t want tokenization. If you don’t want tokenization, use the KeywordTokenizerFactory: it’s a no-op tokenizer.

Kumar_Shikhar · June 16, 2020, 6:33am

@yrodiere but my product needs support whitespace and standard analyzers as well. In this case how would i use keyword as well?

Kumar_Shikhar · June 16, 2020, 6:54am

@yrodiere Please let me know if you have any suitable solutions. Your help will be appreciated.

yrodiere · June 16, 2020, 7:16am

As explained above, if you have multiple different needs, use multiple different fields on the same property. Have one field for your completion that ignores whitespaces, and another for features that consider works only.

If you need both in the same query, you can always create one query for each field, and combine them with a boolean junction to achieve an “OR” operation. Something like this:

Query queryOnFieldWithWhitespaceTokenizer = ...;
Query queryOnFieldWithKeywordTokenizer = ...;
// junctionQuery will match any document matched by the first OR the second query
Query junctionQuery = queryBuilder.bool()
        .should(queryOnFieldWithWhitespaceTokenizer)
        .should(queryOnFieldWithKeywordTokenizer)
        .createQuery();
FullTextQuery query = Search.getFullTextEntityManager(entityManager)
        .createQuery(junctionQuery, MyEntity.class);

See also the documentation.

Kumar_Shikhar · June 17, 2020, 6:51am

@yrodiere will this approach work for wildcards as well?

yrodiere · June 17, 2020, 7:03am

It will match documents that match either query. Whether they are wildcard queries or not is not relevant.

I’m not sure it will fit your requirements, but only you can determine that

Kumar_Shikhar · June 18, 2020, 9:02am

@yrodiere I am indexing a data which is like this in db “FLAT 2, 7, THE OLD NOTTINGHAM ARMS” its with double quotes. I am using WhiteSpace and Standard Tokenizer for this property to be indexed. Analyzers are like this:
@AnalyzerDef(
name = “textanalyzer”,
tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class, params = {
@Parameter(name = “maxTokenLength”, value = “8000”)
}),
filters = {
@TokenFilterDef(factory = StopFilterFactory.class),
@TokenFilterDef(factory = EnglishPossessiveFilterFactory.class),
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = DoubleMetaphoneFilterFactory.class, params = {
@Parameter(name = “maxCodeLength”, value = “10”),
@Parameter(name = “inject”, value = “true”) }),
}
)
@AnalyzerDef(
name = “WithWhitespaceTokenizerFactory”,
tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class
),
filters = {
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = DoubleMetaphoneFilterFactory.class, params = {
@Parameter(name = “maxCodeLength”, value = “10”),
@Parameter(name = “inject”, value = “true”) }),
}
)

@Field(termVector = TermVector.WITH_POSITION_OFFSETS)
@Analyzer(definition = "textanalyzer")
@Field(name = "fullRecord_forWildcards", analyzer = @Analyzer(definition = "WithWhitespaceTokenizerFactory"))
@Column(name = "FullRecord")
private String fullRecord;

and then i am hitting a query using the whitespace analyzed field that is fullRecord_forWildcards. In search String I am passing “FLAT 2, 7, THE OLD NOTTINGHAM ARMS” this with double and expecting the same result. but it does not give any record. Query made is like this : Fulltext_Query_UsingSplit = FullTextQueryImpl(+fullRecord_forWildcards:“flat* +fullRecord_forWildcards:1,* +fullRecord_forWildcards:7,* +fullRecord_forWildcards:the* +fullRecord_forWildcards:old* +fullRecord_forWildcards:nottingham* +fullRecord_forWildcards:arms”* status:Current).

Please let me know why its not giving me the result with double quotes.

yrodiere · June 18, 2020, 4:59pm

Double quotes are not interpreted for keyword and wildcard queries. These are relatively low-level queries, and they expect a field value as input, not a structured query.

If you want to provide a structured query and have it parsed, with features such as phrase queries (with double quotes), prefix queries (* after a word), fuzziness, etc., maybe you should have a look at the “simple query string” query. There is no support for wildcards inside words, though: they will only work at the end of a word.

Kumar_Shikhar · June 22, 2020, 6:17pm

@yrodiere As you have seen my code above in which i am using whitespace tokenizer to tokenize the fullRecord column. But When i perform a wildcard operation on that it gives me result but not as expected. As we know * wildcard means character sequence. Suppose if i search for FLAT* then it should give me the records starting from flat only. But in my case its giving the results which contains flat anywhere in the record as well as starting from flat also.
I am attaching you the logic what i have implemented for this. please have a look and let me know the reason or solution asap. Your help will be appreciated.

This is the method below :

private void createSearchStringQuery(QueryBuilder qb, BooleanJunction bool, String columnName, String matchingString)
{

	List<String> splitInput = new ArrayList<>();
	String Whitespace_Analyzed_token = "fullRecord";
	String tokenizedinput = Whitespace_Analyzed_token;

	matchingString = matchingString.replaceAll("^[\"']+|[\"']+$", "");
	boolean doesWildcardExist = matchingString.contains("?") || matchingString.contains("*");
	boolean checkWhiteSpace = matchingString.contains(" ");

	if (checkWhiteSpace) {
		splitInput = Arrays.asList(matchingString.split("\\s"));
	}
	else {
		splitInput.add(matchingString);
	}
	logger.info("Contains WhiteSpace in Search String :" + " " +checkWhiteSpace);
	logger.info("Contains Wildcard in Search String  :"+ " " +doesWildcardExist);
	logger.info("Splitted Inputs are" + " " +":" +splitInput);
	for (String token : splitInput) {
		org.apache.lucene.search.Query lucene_query = qb.keyword().wildcard().onField(tokenizedinput)
				.matching((token.toLowerCase()) + "*").createQuery();
		bool.must(lucene_query);
	}
	
}

my new pojo is as below :
/**
* {@summary : WhiteSpace Tokenizer for FullRecord Column for WildCard, Splits Word from Whitespaces}
*/
@AnalyzerDef(
name = “WithWhitespaceTokenizerFactory”,
tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class
),
filters = {
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = DoubleMetaphoneFilterFactory.class, params = {
@Parameter(name = “maxCodeLength”, value = “10”),
@Parameter(name = “inject”, value = “true”) }),
}
)

@Field(termVector = TermVector.WITH_POSITION_OFFSETS)
@Analyzer(definition = "WithWhitespaceTokenizerFactory")
@Column(name = "FullRecord")
private String fullRecord;

Kumar_Shikhar · June 22, 2020, 6:30pm

Results coming Like if i search FLAT* then :

SHELL FOR 26 AND FLAT OVER, SHELL FOR 26 AND FLAT OVER
FLAT 26
FLAT 67

Topic		Replies	Views
Hibernate Search on special characters Hibernate Search	7	3776	January 27, 2021
Hibernate Search 7.0: new field is ignored in wildcard()-search Hibernate Search	1	40	December 4, 2024
Lucene With Special Characters Hibernate Search	2	442	September 13, 2023
Equivalent WhitespaceTokenizerFactory in HS 6 Hibernate Search	4	522	February 28, 2022
Which analyzer(s) / tokenizer(s) for specific ID? Hibernate Search	3	449	June 21, 2021

Can Someone Please help me out? I am stucked at wildcard search with special characters using StandardTokenizerFactory

Related topics