Can Someone Please help me out? I am stucked at wildcard search with special characters using StandardTokenizerFactory

Can Someone Please help me out?
I am stuck at wildcard search with special characters using StandardTokenizerFactory.
Is there any way to index my data with special characters using StandardTokenizerFactory.

I your special characters are considered as “punctuation” by the standard tokenizer (i.e. they separate words/sentences), then no, you cannot index these characters while using StandardTokenizerFactory.

User another tokenizer, for example WhitespaceTokenizerFactory.

Note that if you need both, you can always map multiple fields to the same property, and assign a different analyzer to each field. Then at query time you will use whichever field makes more sense for your particular query.

Example with Hibernate Search 6:

public class MyEntity {

    // ...

    // Creates a field named "myText"
    @FullTextField(analyzer = "someAnalyzerWithStandardTokenizerFactory")
    // Creates a field named "myText_forWildcards"
    @FullTextField(name = "myText_forWildcards", analyzer = "someAnalyzerWithWhitespaceTokenizerFactory")
    private String myText;

    // ...
}

Example with Hibernate Search 5:

public class MyEntity {

    // ...

    // Creates a field named "myText"
    @Field(analyzer = @Analyzer(definition = "someAnalyzerWithStandardTokenizerFactory"))
    // Creates a field named "myText_forWildcards"
    @Field(name = "myText_forWildcards", analyzer = @Analyzer(definition = "someAnalyzerWithWhitespaceTokenizerFactory")
    private String myText;

    // ...
}
1 Like

@yrodiere I am unable to understand the solution.
My requirement is like i want to index special characters with StandardTokenizerFactory,
As you said it’s not possible. But i am indexing my data from db to indexes using StandardTokenizerFactory, so i need some solution which will omit the special characters also in indexes so that i can apply wildcard on special characters as well.

There was a typo in my example, I fixed it.

Regardless… If you’re new to Hibernate Search or more generally to Lucene, I’d recommend you stay away from the wilcard queries. They probably don’t behave like you think they do.

The main problem is that wildcard queries ignore analysis completely. It is probably what brought you here: when a user searches for foo.*, the wildcard query will not apply analysis and thus will not drop the traling dot (.). It will then look for documents containing foo. (with a trailing dot), and won’t find anything since your analyzer dropped all the dots at indexing time.

The easiest (but still a bit complex) alternative to wildcard queries is to use an EdgeNGramFilterFactory. You will find details about this solution in this stackoverflow answer.

1 Like

@yrodiere can you please let me know in this given example of yours where should i put the definition of both the analyzers in my code?

Wherever you usually put your analyzer definitions.

See the documentation: https://docs.jboss.org/hibernate/search/5.11/reference/en-US/html_single/#_analyzer

One more question , do we need to create an extra column “myText_forWildcards” in database for particular field which you have mentioned in example?

@yrodiere Thanks a lot man !
Your Approach helped me out.

No you don’t. You can perfectly have one column in database but two fields in the index.

1 Like

@yrodiere Do we have any functionality in hibernate search/lucene so that we can make the wildcard search at the place of whitespaces as well. As of now i have two analyzers in my code(1.Standard 2. Whitespacetokenizer. I have a record like this MILLPOOL COTTAGE (FIRST FLOOR FLAT) , so for this record wildcard works perfectly fine at every place in word and special char as well but as soon i put a wildcard at place of whitespace it fails to give me result. I am searching it like this MILLPOOL COTTAGE?(FIRST FLOOR FLAT). Please suggest if you have a valid approach for this.

If want your search to consider whitespace just as any other character, it basically means you don’t want tokenization. If you don’t want tokenization, use the KeywordTokenizerFactory: it’s a no-op tokenizer.

1 Like

@yrodiere but my product needs support whitespace and standard analyzers as well. In this case how would i use keyword as well?

@yrodiere Please let me know if you have any suitable solutions. Your help will be appreciated.

As explained above, if you have multiple different needs, use multiple different fields on the same property. Have one field for your completion that ignores whitespaces, and another for features that consider works only.

If you need both in the same query, you can always create one query for each field, and combine them with a boolean junction to achieve an “OR” operation. Something like this:

Query queryOnFieldWithWhitespaceTokenizer = ...;
Query queryOnFieldWithKeywordTokenizer = ...;
// junctionQuery will match any document matched by the first OR the second query
Query junctionQuery = queryBuilder.bool()
        .should(queryOnFieldWithWhitespaceTokenizer)
        .should(queryOnFieldWithKeywordTokenizer)
        .createQuery();
FullTextQuery query = Search.getFullTextEntityManager(entityManager)
        .createQuery(junctionQuery, MyEntity.class);

See also the documentation.

1 Like

@yrodiere will this approach work for wildcards as well?

It will match documents that match either query. Whether they are wildcard queries or not is not relevant.

I’m not sure it will fit your requirements, but only you can determine that :slight_smile:

1 Like

@yrodiere I am indexing a data which is like this in db “FLAT 2, 7, THE OLD NOTTINGHAM ARMS” its with double quotes. I am using WhiteSpace and Standard Tokenizer for this property to be indexed. Analyzers are like this:
@AnalyzerDef(
name = “textanalyzer”,
tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class, params = {
@Parameter(name = “maxTokenLength”, value = “8000”)
}),
filters = {
@TokenFilterDef(factory = StopFilterFactory.class),
@TokenFilterDef(factory = EnglishPossessiveFilterFactory.class),
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = DoubleMetaphoneFilterFactory.class, params = {
@Parameter(name = “maxCodeLength”, value = “10”),
@Parameter(name = “inject”, value = “true”) }),
}
)
@AnalyzerDef(
name = “WithWhitespaceTokenizerFactory”,
tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class
),
filters = {
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = DoubleMetaphoneFilterFactory.class, params = {
@Parameter(name = “maxCodeLength”, value = “10”),
@Parameter(name = “inject”, value = “true”) }),
}
)

@Field(termVector = TermVector.WITH_POSITION_OFFSETS)
@Analyzer(definition = "textanalyzer")
@Field(name = "fullRecord_forWildcards", analyzer = @Analyzer(definition = "WithWhitespaceTokenizerFactory"))
@Column(name = "FullRecord")
private String fullRecord;

and then i am hitting a query using the whitespace analyzed field that is fullRecord_forWildcards. In search String I am passing “FLAT 2, 7, THE OLD NOTTINGHAM ARMS” this with double and expecting the same result. but it does not give any record. Query made is like this : Fulltext_Query_UsingSplit = FullTextQueryImpl(+fullRecord_forWildcards:“flat* +fullRecord_forWildcards:1,* +fullRecord_forWildcards:7,* +fullRecord_forWildcards:the* +fullRecord_forWildcards:old* +fullRecord_forWildcards:nottingham* +fullRecord_forWildcards:arms”* status:Current).

Please let me know why its not giving me the result with double quotes.

Double quotes are not interpreted for keyword and wildcard queries. These are relatively low-level queries, and they expect a field value as input, not a structured query.

If you want to provide a structured query and have it parsed, with features such as phrase queries (with double quotes), prefix queries (* after a word), fuzziness, etc., maybe you should have a look at the “simple query string” query. There is no support for wildcards inside words, though: they will only work at the end of a word.

1 Like

@yrodiere As you have seen my code above in which i am using whitespace tokenizer to tokenize the fullRecord column. But When i perform a wildcard operation on that it gives me result but not as expected. As we know * wildcard means character sequence. Suppose if i search for FLAT* then it should give me the records starting from flat only. But in my case its giving the results which contains flat anywhere in the record as well as starting from flat also.
I am attaching you the logic what i have implemented for this. please have a look and let me know the reason or solution asap. Your help will be appreciated.

This is the method below :

private void createSearchStringQuery(QueryBuilder qb, BooleanJunction bool, String columnName, String matchingString)
{

	List<String> splitInput = new ArrayList<>();
	String Whitespace_Analyzed_token = "fullRecord";
	String tokenizedinput = Whitespace_Analyzed_token;

	matchingString = matchingString.replaceAll("^[\"']+|[\"']+$", "");
	boolean doesWildcardExist = matchingString.contains("?") || matchingString.contains("*");
	boolean checkWhiteSpace = matchingString.contains(" ");

	if (checkWhiteSpace) {
		splitInput = Arrays.asList(matchingString.split("\\s"));
	}
	else {
		splitInput.add(matchingString);
	}
	logger.info("Contains WhiteSpace in Search String :" + " " +checkWhiteSpace);
	logger.info("Contains Wildcard in Search String  :"+ " " +doesWildcardExist);
	logger.info("Splitted Inputs are" + " " +":" +splitInput);
	for (String token : splitInput) {
		org.apache.lucene.search.Query lucene_query = qb.keyword().wildcard().onField(tokenizedinput)
				.matching((token.toLowerCase()) + "*").createQuery();
		bool.must(lucene_query);
	}
	
}

my new pojo is as below :
/**
* {@summary : WhiteSpace Tokenizer for FullRecord Column for WildCard, Splits Word from Whitespaces}
*/
@AnalyzerDef(
name = “WithWhitespaceTokenizerFactory”,
tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class
),
filters = {
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = DoubleMetaphoneFilterFactory.class, params = {
@Parameter(name = “maxCodeLength”, value = “10”),
@Parameter(name = “inject”, value = “true”) }),
}
)

@Field(termVector = TermVector.WITH_POSITION_OFFSETS)
@Analyzer(definition = "WithWhitespaceTokenizerFactory")
@Column(name = "FullRecord")
private String fullRecord;

Results coming Like if i search FLAT* then :

  1. SHELL FOR 26 AND FLAT OVER, SHELL FOR 26 AND FLAT OVER
  2. FLAT 26
  3. FLAT 67