Search UTF-8 Hibernate search 6.1.7

Hi everybody.
i am trying to implement hibernate search.
Hibernate version: 5.6.11.Final
Elasticsearch server: 7.10.1
Hibernate search: 6.1.7

I’m trying to implement a search similar to %value% like on sql.
Field “title” class Email,
There are some cases I can’t find it.

My implementation code:

@Entity
@Table(name = "emails")
@Getter
@Setter
@NoArgsConstructor(force = true)
@AllArgsConstructor
@Builder
@Indexed(index = "idx_email")
public class Email implements Serializable {
    private static final long serialVersionUID = 5522175857108981474L;

    @Id
    @GeneratedValue(generator = "UUID")
    @GenericGenerator(name = "UUID", strategy = "org.hibernate.id.UUIDGenerator")
    private String id;

    @FullTextField(name = "title_search", norms = Norms.YES, analyzer = "english")
    private String title;
}

// Custom Analysis
public class MyElasticsearchAnalysisConfigurer implements ElasticsearchAnalysisConfigurer {
    @Override
    public void configure(ElasticsearchAnalysisConfigurationContext context) {
        context.analyzer( "english" ).custom()
                .tokenizer( "standard" )
                .tokenFilters( "lowercase", "snowball_english", "asciifolding" );

        context.tokenFilter( "snowball_english" )
                .type( "snowball" )
                .param( "language", "English" );

        context.analyzer( "name" ).custom()
                .tokenizer( "standard" )
                .tokenFilters( "lowercase", "asciifolding" );
    }
}

// Search

SearchScroll<Email> searchScroll = searchSession.search(Email.class).where(f -> f.bool(b -> {
  
    // region TitleSearch
    if (org.apache.commons.lang3.StringUtils.isNotEmpty(titleSearch)) {
        String wildcard = "" + titleSearch + "";
        List<String> tokens = this.subStringTokenQuery(wildcard);
        String fieldNameTitle = Email_.TITLE + "_search";
        System.out.println("Token: " + tokens);
        for (String token : tokens) {
            String wildcard1 = token.contains(".") ? token.replaceAll("\\.", "?") : token;
            b.must(f.bool()
                .should(f.match().field(fieldNameTitle).matching(token))
                .should(f.match().field(fieldNameTitle).matching(token.toLowerCase()))
                .should(f.wildcard().field(fieldNameTitle).matching("*" + wildcard1 + "*")));
        }
    }
 })).sort(f -> f.composite(b -> {
                if (pageable.getSort().iterator().next().isAscending()) {
                    b.add(f.field(pageable.getSort().iterator().next().getProperty()).asc());
                } else {
                    b.add(f.field(pageable.getSort().iterator().next().getProperty()).desc());
                }
})).scroll(ScrollMode.FORWARD_ONLY.toResultSetType());

// Method split token
private List<String> subStringTokenQuery(String search) {
    if (org.apache.commons.lang3.StringUtils.isEmpty(search)) {
        return Collections.emptyList();
    }

    List<String> characterSpecific = Arrays.asList("-", "+", ":");
    String[] tokens = search.split("[\\s\\.\\+]+");

    List<String> results = new ArrayList<>();
    for (String token: tokens) {
        if (!characterSpecific.contains(token)) {
            results.add(token.trim());
        }
    }
    if (results.isEmpty()) {
        results.add(search);
    }
    return results;
}

I don’t know if I use it with Vietnamese, what the “snowball” value will be set, or how should I modify my code to run the above case.

Thanks everyone.

Sorry, my code paging errror
thanks you

Hey,

First, remove your string splitting code. All of it. You’re not supposed to do that, Elasticsearch is. For your use case, all you should do is configure analyzers and pass the search string, then let Elasticsearch do its magic.

Second, avoid the wildcard predicate: it’s very limited and probably won’t do what you want. If you need to match strings by their prefix, then you need to configure the edge-ngram filter. You’ll find an extensive explanation of the solution here.

So, use this mapping:

@Entity
@Table(name = "emails")
@Getter
@Setter
@NoArgsConstructor(force = true)
@AllArgsConstructor
@Builder
@Indexed(index = "idx_email")
public class Email implements Serializable {
    private static final long serialVersionUID = 5522175857108981474L;

    @Id
    @GeneratedValue(generator = "UUID")
    @GenericGenerator(name = "UUID", strategy = "org.hibernate.id.UUIDGenerator")
    private String id;

    // HEADS UP, change below
    @FullTextField(name = "title_search", norms = Norms.YES,
            analyzer = "english_edgengram", searchAnalyzer = "english")
    private String title;
}

public class MyElasticsearchAnalysisConfigurer implements ElasticsearchAnalysisConfigurer {
    @Override
    public void configure(ElasticsearchAnalysisConfigurationContext context) {
        context.analyzer( "english" ).custom()
                .tokenizer( "standard" )
                .tokenFilters( "lowercase", "snowball_english", "asciifolding" );

        context.analyzer( "english_edgengram" ).custom()
                .tokenizer( "standard" )
                .tokenFilters( "lowercase", "snowball_english", "asciifolding", "edgengram );

        context.tokenFilter( "snowball_english" )
                .type( "snowball" )
                .param( "language", "English" );

	context.tokenFilter("edgengram")
			.type("edge_ngram")
			.param("max_gram", 10)
			.param("min_gram", 1);

        context.analyzer( "name" ).custom()
                .tokenizer( "standard" )
                .tokenFilters( "lowercase", "asciifolding" );
    }
}

And use the following search code, relying on the simpleQueryString predicate to force matches to include all provided terms:



SearchScroll<Email> searchScroll = searchSession.search(Email.class).where(f -> f.bool(b -> {
    // region TitleSearch
    if (org.apache.commons.lang3.StringUtils.isNotEmpty(titleSearch)) {
        b.must(f.simpleQueryString().field(Email_.TITLE + "_search").matching(titleSearch)
                .defaultOperator(BooleanOperator.AND));
    }
 })).sort(f -> f.composite(b -> {
                if (pageable.getSort().iterator().next().isAscending()) {
                    b.add(f.field(pageable.getSort().iterator().next().getProperty()).asc());
                } else {
                    b.add(f.field(pageable.getSort().iterator().next().getProperty()).desc());
                }
})).scroll(ScrollMode.FORWARD_ONLY.toResultSetType());

When you’ve done that, run a few tests, and if the behavior doesn’t match what you need, please explain with a few examples.