Eliminating non matching Records

Kumar_Shikhar · September 16, 2020, 10:29am

Hi Members,
I am using white-space Tokenizer to tokenize my records into indexes.
I have records like Postal codes. eg. PL10 1AA and other like PL11 1AA
Lets suppose I have two records as of now. So, while doing a not query on keyword PL10 it is also removing the record having PL11 which is not correct according to my requirements. Same happens if I search for exact match for PL10 then also I am getting records which contain PL11.

	@AnalyzerDef(name = "Standardanalyzer", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class, params = {
			@Parameter(name = "maxTokenLength", value = "8000") }), filters = {
					@TokenFilterDef(factory = LowerCaseFilterFactory.class),
					@TokenFilterDef(factory = PhoneticFilterFactory.class, params = {
				    @Parameter(name = "encoder", value = "Metaphone"), 
					@Parameter(name = "maxCodeLength", value = "10"),
					@Parameter(name = "inject", value = "true") }), })
	@AnalyzerDef(
			name = "WithWhitespaceTokenizerFactory",
			tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class
			),
	filters = {
			@TokenFilterDef(factory = LowerCaseFilterFactory.class), 
			@TokenFilterDef(factory = PhoneticFilterFactory.class, params = {
			@Parameter(name = "encoder", value = "Metaphone"), 
			@Parameter(name = "maxCodeLength", value = "10"),
			@Parameter(name = "inject", value = "true") }),
			}
 )
	
	@Field(termVector = TermVector.WITH_POSITION_OFFSETS)
	@Analyzer(definition = "WithWhitespaceTokenizerFactory")
	@Field(name = "fullRecordStandard", analyzer = @Analyzer(definition = "Standardanalyzer"))
	@Column(name = "FullRecord")
	private String fullRecord;

Not in query for PL10 1AA : search performed on query -(fullRecordStandard:"(PL pl10) 1aa")

Please suggest how to achieve or which tokenizer is best compatible for exact match or partial match but that should not eliminate/bring such non matching records.
Your help will be appreciated.
Thanks!

Swarnkar · September 16, 2020, 11:29am

White-space tokenizer will never gonna help u in current situation nor the StandardTokenizer
because they split the words on basis of space.
you may have 2 approaches to fllow as work around for your problem
1: try writing your own analyzer which split data in spaces but when you search some thing it will search on basis of first token if match then return result otherwise no.
this approach sounds good but very hard to write such analyzer
2: store data without any spaces and retun result on basis of regex call like this PL10*

gsmet · September 16, 2020, 11:33am

Hi @Kumar_Shikhar

I think it’s time you take a step back and learn more about how analyzers and tokenizers work.

It’s the basis of full text search with Lucene-based technologies and if you don’t understand it, there’s no way you will be able to implement anything.

Kumar_Shikhar · September 16, 2020, 11:39am

I agree with your advice, I don’t have too much knowledge in Lucene.
As I am a newbie to Hibernate Search and Lucene.
I am trying to learn and explore it as much as possible.
I just caught with a requirement which does not satisfy with my current knowledge of tokenizers and analyzers.
Please let me know where I am doing wrong in achieving the requirement.
Thanks !

Kumar_Shikhar · September 16, 2020, 11:51am

@Swarnkar
Ya I tried a lot with white-space/standard tokenizer but failed to achieve.
I have to achieve the exact match it means that if I searched for PL10 then i should be getting the records only which contains PL10 in it.
Same if I am eliminating for records. User might search for whole Post code as well, in that case also I have to check the exact match.
Just wanted to know what are things which needs to mentioned in custom analyzer.

Kumar_Shikhar · September 16, 2020, 12:10pm

@Swarnkar
Second approach does not seem to be working. I have done the wildcard query and stored the records using keyword tokenizer.
doing a query like regex PL10* not bringing any records.

Swarnkar · September 16, 2020, 12:14pm

yes because u have space in between (if any space or any special character is in between then regex wont work)
put this in ur field

obviously change the name of field

Kumar_Shikhar · September 16, 2020, 12:18pm

@Swarnkar
I guess this configuration will work same as what keyword tokenizer does.

Changing name of field and setting analyze = Analyze.NO will bring the results?

Swarnkar · September 16, 2020, 12:23pm

change the name of field according to your variable above was mine.
and if you provide Analyze.NO then it wont tokenize the value while indexing .

Topic		Replies	Views
Not getting exact match in Hibernate Search Hibernate Search	1	1187	September 15, 2020
Equivalent WhitespaceTokenizerFactory in HS 6 Hibernate Search	4	525	February 28, 2022
Hibernate Search 7.0: new field is ignored in wildcard()-search Hibernate Search	1	48	December 4, 2024
Search Returns No Results Hibernate Search	11	1899	August 17, 2020
Can Someone Please help me out? I am stucked at wildcard search with special characters using StandardTokenizerFactory Hibernate Search	28	2200	August 19, 2020

Eliminating non matching Records

Related topics