Hibernate Search on special characters

kamokaizen · February 20, 2020, 2:55pm

Hi guys, I’m using the hibernate full-text search on my application. My problem is that I could not search if the search string has some special characters. For example, the search strings are like that “Honda CR-V” or “Peugeot 206+”, hibernate search returns empty response.

I use the below configuration on the index. Also, I try with WhitespaceTokenizerFactory with changing Tokenizer, but it did not work.

@AnalyzerDef(name = “searchTextAnalyzer”,
tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = StandardFilterFactory.class)
})

My hibernate version is 5.4.5.Final and hibernate search version is 5.11.3.Final.
Could you help me to solve the problem. Thanks for any help.

yrodiere · February 20, 2020, 3:04pm

Hello,

I’ll need more information. In particular:

The code that creates and executes the search query.
The code of the the relevant entities, including in particular @Indexed/@Field annotations.
Do you use the experimental Elasticsearch integration, or just Lucene (the default)?
An example of a search string that doesn’t work with the text that it should match. E.g. you’re looking for “Peugeot 206+”, but what is the name of the car in your database? It it “Peugeot 206+” too, or something slightly different such as “peugeot 206+”?

kamokaizen · February 21, 2020, 6:37am

Hi yrodiere,

1- The generic query method is like the below code; it uses the entity fields and projection fields.

Fields are used to full text search “vehicleId”, “category.categoryId”, “category.name”, “modelDescription”, “longModelDescription”, “brand”, “equipmentType”, “modelType”, “body”, “startYear”, “endYear”

Projection Fields; “vehicleId”, “category.categoryId”, “brand”, “modelDescription”, “longModelDescription”, “startYear”, “endYear”, “body”, “hp”

protected Analyzer analyzer = new WhitespaceAnalyzer();

    @Override
    public List<T> searchProjection(String searchText, String[] fields, String[] projectionFields, BasicTransformerAdapter resultTransformer, Sort sort, int firstResult, int maxResult, Class<T> entityClass) {
        if (Strings.isNullOrEmpty(searchText)) {
            return new ArrayList<>();
        }

        List<String> keywords = this.tokenizeString(this.analyzer, searchText);

        try {
            FullTextSession fullTextSession = this.getFullTextSession();
            QueryBuilder qb = this.getQueryBuilder(fullTextSession, entityClass);

            BooleanJunction<BooleanJunction> booleanJunction = qb.bool();
            booleanJunction.must(qb.keyword().onField("deleted").matching(false).createQuery());

            for (String keyword : keywords) {
                booleanJunction.must(qb.keyword().wildcard().onFields(fields).matching(keyword + "*").createQuery());
            }
            FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery(booleanJunction.createQuery(), entityClass);
            if (projectionFields != null && projectionFields.length > 0) {
                fullTextQuery.setProjection(projectionFields);
                fullTextQuery.setResultTransformer(resultTransformer);
            }
            fullTextQuery.setFirstResult(firstResult); //start from the firstResult element
            fullTextQuery.setMaxResults(maxResult); //return max elements
            if (sort != null) {
                fullTextQuery.setSort(sort);
            }
            return fullTextQuery.getResultList();
        } catch (EmptyQueryException ex) {
            logger.error("Something went wrong while searching: {}, exception:{}", searchText, ex.getLocalizedMessage());
            return new ArrayList<>();
        } catch (Exception ex) {
            logger.error("Something went wrong while searching: {}, exception:{}", searchText, ex.getLocalizedMessage());
        }
        return null;
    }

    /**
     * Validate input against the tokenizer and return a list of terms.
     *
     * @param analyzer
     * @param string
     * @return
     */
    public List<String> tokenizeString(Analyzer analyzer, String string) {
        List<String> result = new ArrayList<>();
        try {
            TokenStream stream = analyzer.tokenStream(null, new StringReader(string));
            stream.reset();
            while (stream.incrementToken()) {
                result.add(stream.getAttribute(CharTermAttribute.class).toString().toLowerCase());
            }
            stream.close();
        } catch (IOException e) {
            throw new RuntimeException(e);
        } catch (Exception ex) {
            ex.printStackTrace();
        }
        return result;
    }

2- Entity model is like below;

@Data
@Entity
@Indexed
@Table(name = "vehicle")
@NoArgsConstructor
@EqualsAndHashCode(of = {"vehicleId"}, callSuper=false)
@JsonIgnoreProperties({"hibernateLazyInitializer", "handler"})
public class VehicleDbo extends MappedDomainObjectBase {
    @Id
    @GeneratedValue(generator = "uuid2")
    @GenericGenerator(name = "uuid2", strategy = "uuid2")
    @Column(name = "vehicle_id", nullable = false, unique = true)
    private String vehicleId;

    @ManyToOne(fetch = FetchType.LAZY)
    @JoinColumn(name = "category_id", referencedColumnName = "category_id")
    @IndexedEmbedded(includeEmbeddedObjectId = true)
    private CategoryDbo category;

    @Column(name = "model")
    private String model;

    @Column(name = "brand")
    @Field(index = org.hibernate.search.annotations.Index.YES, analyze = Analyze.YES, store = Store.YES, analyzer = @Analyzer(definition = "searchTextAnalyzer"))
    private String brand;

    @Column(name = "model_description")
    @Field(index = org.hibernate.search.annotations.Index.YES, analyze = Analyze.YES, store = Store.YES, analyzer = @Analyzer(definition = "searchTextAnalyzer"))
    private String modelDescription;

    @Column(name = "width")
    private int width;

    @Column(name = "height")
    private int height;

    @Column(name = "body")
    @Field(index = org.hibernate.search.annotations.Index.YES, analyze = Analyze.YES, store = Store.YES, analyzer = @Analyzer(definition = "searchTextAnalyzer"))
    private String body;

    @Column(name = "end_year")
    @Field(index = org.hibernate.search.annotations.Index.YES, analyze = Analyze.YES, store = Store.YES, analyzer = @Analyzer(definition = "searchTextAnalyzer"))
    @FieldBridge(impl = LongBridge.class)
    private int endYear;

    @Column(name = "fuel_type")
    private String fuelType;

    @Column(name = "fuel_type_enum")
    @Enumerated(EnumType.ORDINAL)
    private FuelTypeEnum fuelTypeEnum;

    @Column(name = "equipment_type")
    @Field(index = org.hibernate.search.annotations.Index.YES, analyze = Analyze.YES, store = Store.YES, analyzer = @Analyzer(definition = "searchTextAnalyzer"))
    private String equipmentType;

    @Column(name = "long_model_description")
    @Field(index = org.hibernate.search.annotations.Index.YES, analyze = Analyze.YES, store = Store.YES, analyzer = @Analyzer(definition = "searchTextAnalyzer"))
    private String longModelDescription;

    @Column(name = "doors")
    private int doors;

    @Column(name = "acceleration")
    private double acceleration;

    @Column(name = "hp")
    @Field(index = org.hibernate.search.annotations.Index.YES, analyze = Analyze.YES, store = Store.YES, analyzer = @Analyzer(definition = "searchTextAnalyzer"))
    @FieldBridge(impl = LongBridge.class)
    private int hp;

    @Column(name = "udc")
    private double udc;

    @Column(name = "eudc")
    private double eudc;

    @Column(name = "vehicle_type")
    private int vehicleType;

    @Column(name = "auto_class")
    private String autoClass;

    @Column(name = "start_year")
    @Field(index = org.hibernate.search.annotations.Index.YES, analyze = Analyze.YES, store = Store.YES, analyzer = @Analyzer(definition = "searchTextAnalyzer"))
    @FieldBridge(impl = LongBridge.class)
    private int startYear;

    @Column(name = "seats")
    private int seats;

    @Column(name = "cylinders")
    private int cylinders;

    @Column(name = "ccm")
    private int ccm;
}

3- I use elastic search integration.

spring.jpa.properties.hibernate.search.default.indexmanager=elasticsearch
spring.jpa.properties.hibernate.search.default.elasticsearch.aws.signing.enabled=true
spring.jpa.properties.hibernate.search.default.elasticsearch.host=${ELASTIC_SEARCH_URL}
spring.jpa.properties.hibernate.search.default.elasticsearch.aws.access_key=${ELASTIC_SEARCH_AWS_ACCESS_KEY}
spring.jpa.properties.hibernate.search.default.elasticsearch.aws.secret_key=${ELASTIC_SEARCH_AWS_SECRET_KEY}
spring.jpa.properties.hibernate.search.default.elasticsearch.aws.region=${ELASTIC_SEARCH_AWS_REGION}
spring.jpa.properties.hibernate.search.default.elasticsearch.index_schema_management_strategy=create
spring.jpa.properties.hibernate.search.default.elasticsearch.required_index_status=yellow
spring.jpa.properties.hibernate.search.default.elasticsearch.read_timeout=600000
spring.jpa.properties.hibernate.search.default.elasticsearch.index_management_wait_timeout=600000

4- For example when searching with only “206+” it returns empty. if a search with “206” it returns all match because of wildcard match.

This is the database records for 206+

SELECT long_model_description FROM car.vehicle where long_model_description like '%206+%';

206+ 1.4 Comfort
206+ 1.4 Urban Move
206+ 1.4 Sportium
206+ 1.4 HDI Sportium
206+ 1.4 Sportium
206+ 1.4 HDI Urban Move
206+ 1.4 HDI Comfort
206+ 1.4 HDI Envy
206+ 1.4 HDI Sportium
206+ 1.4 Envy

yrodiere · February 21, 2020, 8:27am

Ok, the problem is here:

protected Analyzer analyzer = new WhitespaceAnalyzer();

And here:

        List<String> keywords = this.tokenizeString(this.analyzer, searchText);

And here:

            for (String keyword : keywords) {
                booleanJunction.must(qb.keyword().wildcard().onFields(fields).matching(keyword + "*").createQuery());
            }

You’re tokenizing the input string with a whitespace analyzer. This will just split the string on whitespaces, so for 206+ it won’t do anything. Then you’re creating one wildcard query per token, in this case 206+*. Since parameters to wildcard queries (in your case, 206+) are not analyzed, Elasticsearch will look for all documents that contain a token starting with 206+.

Problem is, with the analyzer you picked for field at indexing time, I’m almost certain 206+ is transformed to just 206, because (IIRC) StandardFilterFactory removes special characters such as +. So the index does not contain 206+, only 206.

The solution to your problem would be to use the same analyzer for queries as for indexing. So replace this:

protected Analyzer analyzer = new WhitespaceAnalyzer();

With this:

// Same analyzer as "searchTextAnalyzer"
protected Analyzer analyzer = new CustomAnalyzer.Builder()
        .withTokenizer( StandardTokenizerFactory.class )
        .addTokenFilter( LowerCaseFilterFactory.class )
        .addTokenFilter( StandardFilterFactory.class )
        .build();

This will, however, ignore the + sign completely in searches, which is probably not what you want. You should also consider tuning your analyzer, maybe replace the standard filter with just AsciiFoldingFilter + LowerCaseFilter. Be sure to update both analyzers (the one configured in Hibernate Search through annotations, and the one you instantiate directly).

Another, probably better solution would be to get rid of all local analysis, and to delegate everything to Elasticsearch. For this, you will have to use a more advanced technique that does not use the wildcard query, but instead uses the EdgeNGramFilter in the analyzer and overrides the analyzer at query time. See here for more information.

kamokaizen · February 22, 2020, 2:44pm

I changed StandardFilterFactory with AsciiFoldingFilter, and recreate both index and custom analyzer. it works now, thank you @yrodiere. Appreciate

Humoyun_Norboboev · January 27, 2021, 9:47am

Hi. I have also such kind of problem. I tried to use your solution, but it didn’t work, it seems I’m doing smth wrong. Could you help to find what is problem exactly with my code? Thanks for any help.

@Indexed
@Getter
@Setter
@Entity`Preformatted text`
@DynamicInsert
@DynamicUpdate
@Table(name = "contracts")
@Audited(targetAuditMode = RelationTargetAuditMode.NOT_AUDITED, withModifiedFlag = true)
@org.hibernate.annotations.Cache(usage = CacheConcurrencyStrategy.NONSTRICT_READ_WRITE)
@AnalyzerDef(name = "searchTextAnalyzer",
        tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
        filters = {@TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class)})
public class _Contract extends _Entity {
    
    @Fields(value = {@Field(analyze = Analyze.YES, bridge = @FieldBridge(impl = StringBridge.class), index = Index.YES, analyzer = @Analyzer(definition = "searchTextAnalyzer")),
            @Field(analyze = Analyze.NO, name = "codeSort", index = org.hibernate.search.annotations.Index.NO)})
    @SortableField(forField = "codeSort")
    @Column(unique = true)
    private String code;
...

yrodiere · January 27, 2021, 10:00am

@Humoyun_Norboboev Sure we can help, but please create your own topic.

Include in particular:

The code that creates and executes the search query.
The code of the the relevant entities, including in particular @Indexed / @Field annotations.
Do you use the experimental Elasticsearch integration, or just Lucene (the default)?
An example of a search string that doesn’t work with the text that it should match . E.g. you’re looking for “Peugeot 206+”, but what is the name of the car in your database? It it “Peugeot 206+” too, or something slightly different such as “peugeot 206+”?

Topic		Replies	Views
Hibernate Search with special characters Hibernate Search	6	1111	January 28, 2021
Can Someone Please help me out? I am stucked at wildcard search with special characters using StandardTokenizerFactory Hibernate Search	28	2184	August 19, 2020
Lucene With Special Characters Hibernate Search	2	414	September 13, 2023
Search UTF-8 Hibernate search 6.1.7 Hibernate Search	2	369	November 7, 2022
Search Returns No Results Hibernate Search	11	1773	August 17, 2020

Hibernate Search on special characters

Related topics