Hibernate Search 6.1.4.Final + Elasticsearch 7.16.3, simpleQueryString flags

Hello,

1 ) I have a question regarding simpleQueryString flags. In current newest version (6.1.4.Final) there is still no support for flags: SLOP and NONE (check SimpleQueryFlag.java), do you have any plans to add support for these flags?

Elasticsearch documentation:

  1. If there is not specified flags then all flags are active like wildcard, fuzzy and so on:

SimpleQueryStringPredicateOptionsStep<?> predicate = query
.matching(searchQuery)
.analyzer(Analyzers.STANDARD_ANALYZER)
.defaultOperator(BooleanOperator.AND);

In my case I would like to disable all flags. But I cannot choose flag NONE (no support in SimpleQueryFlag) so as a workaround I have to choose SimpleQueryFlag.WHITESPACE with the least impact to disable all others:

SimpleQueryStringPredicateOptionsStep<?> predicate = query
.matching(searchQuery)
.analyzer(Analyzers.STANDARD_ANALYZER)
.flags(SimpleQueryFlag.WHITESPACE)
.defaultOperator(BooleanOperator.AND);

Maybe there could be a function that accept strings instead of enum and then properly handle it??

Example:

SimpleQueryStringPredicateOptionsStep<?> predicate = query
.matching(searchQuery)
.analyzer(Analyzers.STANDARD_ANALYZER)
.flags(“NONE”)
.defaultOperator(BooleanOperator.AND);

  1. I found something strange. Even If I activated just one flag lets say .flags(SimpleQueryFlag.WHITESPACE) the wildcard was still working. Only wildcard, others were disabled properly. As a result i could just get all data from my index. As a workaround I had to parse input and remove all wildcards.
    Why wildcard is active when I choose only one flag f.e. SimpleQueryFlag.WHITESPACE?? I think in such case just one operator should work (whitespace) an all other should be disabled.

Hello,

As explained in the documentation you linked, SLOP is synonymous to NEAR, which we already support. We don’t intend to add new flags that are redundant.

NONE can be achieved by passing an empty list of flags: .flags(Collections.emptySet()).

Use .flags(Collections.emptySet()).

This is not normal. Are you sure it’s not caused by you using specific analyzers that reproduce the behavior of a prefix query (e.g. with the edge_ngram filter)? Can you provide a reproducer based on this template?

Sorry: I just checked the code, and it turns out an empty set of flags is not interpreted as NONE as I expected, and as we do for other flags elsewhere… I opened [HSEARCH-4536] - Hibernate JIRA to fix this.

EDIT: This was fixed in Hibernate Search 6.1.5.Final.

1 Like

Cool, thank you.

Ok, I will do it but later, don’t know when, for now I don’t have much time unfortunately.
But for now i will copy part of my test and I will have a fast question:

String searchQuery = “*”;
List products = getSession()
.search(Product.class)
.where(f → f.simpleQueryString().fields(“description_normalized”)
.matching(searchQuery)
.analyzer(ElasticsearchCustoms.Analyzers.STANDARD_ANALYZER)
.flags(SimpleQueryFlag.WHITESPACE)
.defaultOperator(BooleanOperator.AND)
).fetchAllHits();

I have a char filter to remove special characters:

context.charFilter(CharFilters.PATTERN_REPLACE_SPECIAL_CHARS)
.type(“pattern_replace”)
.param(“pattern”, “[^a-zA-Z0-9]+”)
.param(“replacement”, “”);

For field “description_normalized” during indexing I use:

context.normalizer(Normalizers.STANDARD_NORMALIZER).custom()
.tokenFilters(“lowercase”)
.charFilters(CharFilters.PATTERN_REPLACE_SPECIAL_CHARS)

And during search I use:

context.analyzer(Analyzers.STANDARD_ANALYZER).custom()
.tokenizer(“standard”)
.charFilters(CharFilters.PATTERN_REPLACE_SPECIAL_CHARS)
.tokenFilters(“lowercase”);

In my opinion the wildcard should be removed by CharFilters.PATTERN_REPLACE_SPECIAL_CHARS during search and it shouldn’t have any impact. But this search returns all results(even “description_normalized” that are nulls). What is your opinion?

Yes the wildcard should get removed during analysis. And I would expect a query without any token to not return any hit. I’m checking.

I’m getting exactly this behavior in our integration tests where we pass a string that gets analyzed down to an empty string.

I’ll need a reproducer if you want me to investigate this further. Thanks.

Ok I prepared a reproducer:

Thanks for the reproducer. I spent quite some time on this, and must admit I was a bit dumbfounded… until I stumbled upon this code in Lucene:

    if ("*".equals(queryText.trim())) {
      return new MatchAllDocsQuery();
    }

With any other character, you would get the behavior you want, because the text would indeed get analyzed, would be reduced to an empty list of tokens, and eventually the parsed query would be a MatchNoDocsQuery.

But you hit the one character hardcoded in the parser that triggers this behavior of matching every document.

This is Lucene, not Elasticsearch, but I’m fairly confident Elasticsearch relies on this exact parser and thus is affected the same way.

I’m afraid there isn’t much Hibernate Search can do to remove this behavior, and I suspect reporting it to Lucene/Elasticsearch won’t help, as this seems fairly intentional, although undocumented.

Maybe I should document this somewhere, though :confused:

I guess in your application, you could pre-process the query string to replace * with an empty string. But that’s about all I have to suggest.

1 Like

I would say it’s a bit dangerous, someone implementing search by simpleQueryString can accidentally allow users to access all the data (without even thinking about this situation), yayks.

I’m currently removing all * from the query, but I didn’t like that solution and thought there was a better way to solve this problem or I missed something.

But now everything is clear, thanks so much for explanations, suggestions and your time.

Glad it’s clearer now. Just one minor comment:

While I agree to some extent, any filters you want to apply should be added separately through a bool predicate, and would thus be unaffected by what the user types: even * would be filtered.

Also, if you want strict security, it would be a good idea to also have a safety net after the query execution, to check that each returned search hit is indeed accessible by the user who requested the search. That’s important in particular because the index may lag behind the database, and permissions might change over time. I know Spring Security allows post-execution security checks through AOP, i.e. annotations on your service methods + dedicated components that check permissions on a given object; and I’m sure other frameworks offer similar solutions.