what is the exact ‘format’ specified for ‘words’ in file.properties ?
to build those files:
stopwords.properties
mapping-chars.properties
i didnt found a prototype or example in th documentation [#example-analyzer-def](Hibernate Search 7.0.0.Final: Reference Documentation)
When in doubt, look at the relevant javadoc.
In this case, org.apache.lucene.analysis.core.StopFilterFactory
:
* The valid values for the <code>format</code> option are:
* </p>
* <ul>
* <li><code>wordset</code> - This is the default format, which supports one word per
* line (including any intra-word whitespace) and allows whole line comments
* begining with the "#" character. Blank lines are ignored. See
* {@link WordlistLoader#getLines WordlistLoader.getLines} for details.
* </li>
* <li><code>snowball</code> - This format allows for multiple words specified on each
* line, and trailing comments may be specified using the vertical line ("|").
* Blank lines are ignored. See
* {@link WordlistLoader#getSnowballWordSet WordlistLoader.getSnowballWordSet}
* for details.
* </li>
And the method:
/**
* Accesses a resource by name and returns the (non comment) lines containing
* data using the given character encoding.
*
* <p>
* A comment line is any line that starts with the character "#"
* </p>
*
* @return a list of non-blank non-comment lines with whitespace trimmed
* @throws IOException If there is a low-level I/O error.
*/
public static List<String> getLines(InputStream stream, Charset charset) throws IOException{
BufferedReader input = null;
ArrayList<String> lines;
boolean success = false;
try {
input = getBufferedReader(IOUtils.getDecodingReader(stream, charset));
lines = new ArrayList<>();
for (String word=null; (word=input.readLine())!=null;) {
// skip initial bom marker
if (lines.isEmpty() && word.length() > 0 && word.charAt(0) == '\uFEFF')
word = word.substring(1);
// skip comments
if (word.startsWith("#")) continue;
word=word.trim();
// skip blank lines
if (word.length()==0) continue;
lines.add(word);
}
success = true;
return lines;
} finally {
if (success) {
IOUtils.close(input);
} else {
IOUtils.closeWhileHandlingException(input);
}
}
}
So it’s just a text file with one word per line.
1 Like
Thank you for this quick reply,
it’s really good for me to ask a simple question in order to simplify the understanding of an complex process.
so i did put the stopwords.txt in the said format:
aux
avec
ce
ces
dans
de
des
du
elle
en
i still get this exception
Caused by: org.hibernate.search.exception.SearchException: Could not initialize Analyzer definition @org.hibernate.search.annotations.AnalyzerDef(charFilters=[@org.hibernate.search.annotations.CharFilterDef(name=, params=[@org.hibernate.search.annotations.Parameter(name=mapping, value=search/analyzer/mapping-chars.properties)], factory=class org.apache.lucene.analysis.charfilter.MappingCharFilterFactory), @org.hibernate.search.annotations.CharFilterDef(name=, params=[], factory=class org.apache.lucene.analysis.charfilter.HTMLStripCharFilterFactory)], filters=[@org.hibernate.search.annotations.TokenFilterDef(name=, params=[], factory=class org.apache.lucene.analysis.standard.StandardFilterFactory), @org.hibernate.search.annotations.TokenFilterDef(name=, params=[], factory=class org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilterFactory), @org.hibernate.search.annotations.TokenFilterDef(name=, params=[@org.hibernate.search.annotations.Parameter(name=language, value=French)], factory=class org.apache.lucene.analysis.snowball.SnowballPorterFilterFactory), @org.hibernate.search.annotations.TokenFilterDef(name=, params=[], factory=class org.apache.lucene.analysis.core.LowerCaseFilterFactory), @org.hibernate.search.annotations.TokenFilterDef(name=, params=[@org.hibernate.search.annotations.Parameter(name=words, value=search/analyzer/stopwords_fr.txt), @org.hibernate.search.annotations.Parameter(name=ignoreCase, value=true)], factory=class org.apache.lucene.analysis.core.StopFilterFactory)], name=customAnalyzer, tokenizer=@org.hibernate.search.annotations.TokenizerDef(name=, params=[], factory=class org.apache.lucene.analysis.standard.StandardTokenizerFactory))
at org.hibernate.search.analyzer.impl.LuceneAnalyzerBuilder.buildAnalyzer(LuceneAnalyzerBuilder.java:79) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.analyzer.impl.NamedLuceneAnalyzerReference.createAnalyzer(NamedLuceneAnalyzerReference.java:65) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.analyzer.impl.NamedLuceneAnalyzerReference.initialize(NamedLuceneAnalyzerReference.java:61) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.analyzer.impl.LuceneEmbeddedAnalyzerStrategy.initializeReference(LuceneEmbeddedAnalyzerStrategy.java:208) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.analyzer.impl.LuceneEmbeddedAnalyzerStrategy.lambda$initializeReferences$0(LuceneEmbeddedAnalyzerStrategy.java:166) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) ~[na:1.8.0_144]
at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) ~[na:1.8.0_144]
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) ~[na:1.8.0_144]
at org.hibernate.search.analyzer.impl.LuceneEmbeddedAnalyzerStrategy.initializeReferences(LuceneEmbeddedAnalyzerStrategy.java:166) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.engine.impl.SearchIntegrationConfigContext.initialize(SearchIntegrationConfigContext.java:73) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.engine.impl.ConfigContext.initIntegrations(ConfigContext.java:251) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.spi.SearchIntegratorBuilder.initDocumentBuilders(SearchIntegratorBuilder.java:470) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.spi.SearchIntegratorBuilder.createNewFactoryState(SearchIntegratorBuilder.java:244) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.spi.SearchIntegratorBuilder.buildNewSearchFactory(SearchIntegratorBuilder.java:200) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.spi.SearchIntegratorBuilder.buildSearchIntegrator(SearchIntegratorBuilder.java:128) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.hcore.impl.HibernateSearchSessionFactoryObserver.boot(HibernateSearchSessionFactoryObserver.java:127) ~[hibernate-search-orm-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.hcore.impl.HibernateSearchSessionFactoryObserver.sessionFactoryCreated(HibernateSearchSessionFactoryObserver.java:94) ~[hibernate-search-orm-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.internal.SessionFactoryObserverChain.sessionFactoryCreated(SessionFactoryObserverChain.java:35) ~[hibernate-core-5.3.11.Final.jar:5.3.11.Final]
at org.hibernate.internal.SessionFactoryImpl.<init>(SessionFactoryImpl.java:372) ~[hibernate-core-5.3.11.Final.jar:5.3.11.Final]
at org.hibernate.boot.internal.SessionFactoryBuilderImpl.build(SessionFactoryBuilderImpl.java:467) ~[hibernate-core-5.3.11.Final.jar:5.3.11.Final]
at org.hibernate.jpa.boot.internal.EntityManagerFactoryBuilderImpl.build(EntityManagerFactoryBuilderImpl.java:939) ~[hibernate-core-5.3.11.Final.jar:5.3.11.Final]
at org.springframework.orm.jpa.vendor.SpringHibernateJpaPersistenceProvider.createContainerEntityManagerFactory(SpringHibernateJpaPersistenceProvider.java:57) ~[spring-orm-5.1.9.RELEASE.jar:5.1.9.RELEASE]
at org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean.createNativeEntityManagerFactory(LocalContainerEntityManagerFactoryBean.java:365) ~[spring-orm-5.1.9.RELEASE.jar:5.1.9.RELEASE]
at org.springframework.orm.jpa.AbstractEntityManagerFactoryBean.buildNativeEntityManagerFactory(AbstractEntityManagerFactoryBean.java:390) ~[spring-orm-5.1.9.RELEASE.jar:5.1.9.RELEASE]
... 106 common frames omitted
Caused by: java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:281) ~[na:1.8.0_144]
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) ~[na:1.8.0_144]
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[na:1.8.0_144]
at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[na:1.8.0_144]
at java.io.BufferedReader.fill(BufferedReader.java:161) ~[na:1.8.0_144]
at java.io.BufferedReader.readLine(BufferedReader.java:324) ~[na:1.8.0_144]
at java.io.BufferedReader.readLine(BufferedReader.java:389) ~[na:1.8.0_144]
at org.apache.lucene.analysis.util.WordlistLoader.getLines(WordlistLoader.java:217) ~[lucene-analyzers-common-5.5.5.jar:5.5.5 b3441673c21c83762035dc21d3827ad16aa17b68 - sarowe - 2017-10-20 08:57:36]
at org.apache.lucene.analysis.util.AbstractAnalysisFactory.getLines(AbstractAnalysisFactory.java:252) ~[lucene-analyzers-common-5.5.5.jar:5.5.5 b3441673c21c83762035dc21d3827ad16aa17b68 - sarowe - 2017-10-20 08:57:36]
at org.apache.lucene.analysis.util.AbstractAnalysisFactory.getWordSet(AbstractAnalysisFactory.java:241) ~[lucene-analyzers-common-5.5.5.jar:5.5.5 b3441673c21c83762035dc21d3827ad16aa17b68 - sarowe - 2017-10-20 08:57:36]
at org.apache.lucene.analysis.core.StopFilterFactory.inform(StopFilterFactory.java:109) ~[lucene-analyzers-common-5.5.5.jar:5.5.5 b3441673c21c83762035dc21d3827ad16aa17b68 - sarowe - 2017-10-20 08:57:36]
at org.hibernate.search.analyzer.impl.LuceneAnalyzerBuilder.injectResourceLoader(LuceneAnalyzerBuilder.java:159) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.analyzer.impl.LuceneAnalyzerBuilder.buildAnalysisComponent(LuceneAnalyzerBuilder.java:153) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.analyzer.impl.LuceneAnalyzerBuilder.buildAnalyzer(LuceneAnalyzerBuilder.java:127) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.analyzer.impl.LuceneAnalyzerBuilder.buildAnalyzer(LuceneAnalyzerBuilder.java:108) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
at org.hibernate.search.analyzer.impl.LuceneAnalyzerBuilder.buildAnalyzer(LuceneAnalyzerBuilder.java:76) ~[hibernate-search-engine-5.11.5.Final.jar:5.11.5.Final]
... 129 common frames omitted
this is my customAnalyzer Def:
@AnalyzerDef(name = "_customAnalyzer",
// understandable
charFilters = {
// Replaces one or more characters with one or more characters,
// based on mappings specified in the resource file
@CharFilterDef(factory = MappingCharFilterFactory.class,
params = {@Parameter(name = "mapping", value = "search/analyzer/mapping-chars.properties")}),
//Remove HTML standard tags, keeping the text
@CharFilterDef(factory = HTMLStripCharFilterFactory.class)
},
//Use the Lucene StandardTokenizer
tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
//Remove dots from acronyms and 's from words
@TokenFilterDef(factory = StandardFilterFactory.class),
//Remove accents for languages like French
@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
//Reduces a word to its root in a given language.
// (eg. protect, protects, protection share the same root).
// Using such a filter allows searches matching related words.
@TokenFilterDef(factory = SnowballPorterFilterFactory.class, params = {
@Parameter(name = "language", value = "French")
}),
// Lower cases all words
@TokenFilterDef(factory = LowerCaseFilterFactory.class),
// Remove words (tokens) matching a list of stop words
@TokenFilterDef(
factory = StopFilterFactory.class
,
params = {
@Parameter(name = "words", value = "search/analyzer/stopwords_fr.txt"),
@Parameter(name = "ignoreCase", value = "true")
}
)
})
Caused by: java.nio.charset.MalformedInputException: Input length = 1
@yrodiere
Well. I think it’s caused by accent in European characters (not supported encodings),
which is work fine when i elminate those words !
exp:
êtes
étais
était
étions
Yes, this means your file uses the wrong charset:
Caused by: java.nio.charset.MalformedInputException: Input length = 1
From what I can see in org.apache.lucene.analysis.util.AbstractAnalysisFactory#getLines
, Lucene expects UTF-8.
1 Like