Question
I did find the full text search by NGramFilterFactory.java
for a way to set up properties of letter
, digit
, whitespace
, symbol
in token_chars
.
But the NGramFilterFactory.java is only offers only two properties as minGramSize and maxGramSize.
The latest version of NGramFilterFactory.java in lucene-analyzers-common.8.2.0 is also does not support configuration.
I need to set up token_chars
. Our project’s data is complex.
Please support this issue.
Environment
- @AnalyzerDef
@Indexed(index = "####")
@AnalyzerDef(name = "ngram",
tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
@TokenFilterDef(factory = StandardFilterFactory.class),
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
@TokenFilterDef(factory = NGramFilterFactory.class,
params = {
@Parameter(name = "minGramSize", value = "1"),
@Parameter(name = "maxGramSize", value = "300")
})
})
- mvn dependency:tree
[INFO] +- org.hibernate:hibernate-search-orm:jar:5.11.3.Final:compile
[INFO] | \- org.hibernate:hibernate-search-engine:jar:5.11.3.Final:compile
[INFO] | +- org.apache.lucene:lucene-core:jar:5.5.5:compile
[INFO] | +- org.apache.lucene:lucene-misc:jar:5.5.5:compile
**[INFO] | +- org.apache.lucene:lucene-analyzers-common:jar:5.5.5:compile**
[INFO] | +- org.apache.lucene:lucene-facet:jar:5.5.5:compile
[INFO] | | \- org.apache.lucene:lucene-queries:jar:5.5.5:compile
[INFO] | \- org.apache.lucene:lucene-queryparser:jar:5.5.5:compile
[INFO] +- org.hibernate:hibernate-search-elasticsearch:jar:5.11.3.Final:compile
[INFO] | +- org.elasticsearch.client:elasticsearch-rest-client:jar:6.4.3:compile
[INFO] | | +- org.apache.httpcomponents:httpasyncclient:jar:4.1.4:compile
[INFO] | | \- org.apache.httpcomponents:httpcore-nio:jar:4.4.11:compile
[INFO] | \- org.elasticsearch.client:elasticsearch-rest-client-sniffer:jar:5.6.8:compile
- pom.xml
<!-- hibernate -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
<!--<version>2.1.8.RELEASE</version>-->
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-core</artifactId>
<version>5.4.1.Final</version>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-search-orm</artifactId>
<version>5.11.3.Final</version>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-search-elasticsearch</artifactId>
<version>5.11.3.Final</version>
</dependency>
- elastic version : 6.3.2
NGram Tokenizer - The latest version also does not support configuration.
Note
I’ve implemented indexing for large data about 4,000,000. Do you recommend which other is the best code?
@Transactional
public void buildLargeSearchIndex() {
int offset = 0;
int batchSize =1000;
boolean indexComplete = false;
while (!indexComplete) {
FullTextEntityManager fullTextEntityManager =
org.hibernate.search.jpa.Search.getFullTextEntityManager(entityManager);
TypedQuery<####> query = fullTextEntityManager
.createQuery("SELECT u FROM #### u", ####.class);
query.setFirstResult(offset);
query.setMaxResults(batchSize);
log.info("Indexing {}, offset {}", batchSize, offset);
List<####> results = query.getResultList();
if (results == null || results.isEmpty()) {
indexComplete = true;
} else {
offset += results.size();
for (#### user : results) {
fullTextEntityManager.index(user);
}
}
}
log.info("Indexed {} objects", offset);
}