Eliminating non matching Records

Hi Members,
I am using white-space Tokenizer to tokenize my records into indexes.
I have records like Postal codes. eg. PL10 1AA and other like PL11 1AA
Lets suppose I have two records as of now. So, while doing a not query on keyword PL10 it is also removing the record having PL11 which is not correct according to my requirements. Same happens if I search for exact match for PL10 then also I am getting records which contain PL11.

	@AnalyzerDef(name = "Standardanalyzer", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class, params = {
			@Parameter(name = "maxTokenLength", value = "8000") }), filters = {
					@TokenFilterDef(factory = LowerCaseFilterFactory.class),
					@TokenFilterDef(factory = PhoneticFilterFactory.class, params = {
				    @Parameter(name = "encoder", value = "Metaphone"), 
					@Parameter(name = "maxCodeLength", value = "10"),
					@Parameter(name = "inject", value = "true") }), })
			name = "WithWhitespaceTokenizerFactory",
			tokenizer = @TokenizerDef(factory = WhitespaceTokenizerFactory.class
	filters = {
			@TokenFilterDef(factory = LowerCaseFilterFactory.class), 
			@TokenFilterDef(factory = PhoneticFilterFactory.class, params = {
			@Parameter(name = "encoder", value = "Metaphone"), 
			@Parameter(name = "maxCodeLength", value = "10"),
			@Parameter(name = "inject", value = "true") }),
	@Field(termVector = TermVector.WITH_POSITION_OFFSETS)
	@Analyzer(definition = "WithWhitespaceTokenizerFactory")
	@Field(name = "fullRecordStandard", analyzer = @Analyzer(definition = "Standardanalyzer"))
	@Column(name = "FullRecord")
	private String fullRecord;

Not in query for PL10 1AA : search performed on query -(fullRecordStandard:"(PL pl10) 1aa")

Please suggest how to achieve or which tokenizer is best compatible for exact match or partial match but that should not eliminate/bring such non matching records.
Your help will be appreciated.

White-space tokenizer will never gonna help u in current situation nor the StandardTokenizer
because they split the words on basis of space.
you may have 2 approaches to fllow as work around for your problem
1: try writing your own analyzer which split data in spaces but when you search some thing it will search on basis of first token if match then return result otherwise no.
this approach sounds good but very hard to write such analyzer
2: store data without any spaces and retun result on basis of regex call like this PL10*

Hi @Kumar_Shikhar

I think it’s time you take a step back and learn more about how analyzers and tokenizers work.

It’s the basis of full text search with Lucene-based technologies and if you don’t understand it, there’s no way you will be able to implement anything.

I agree with your advice, I don’t have too much knowledge in Lucene.
As I am a newbie to Hibernate Search and Lucene.
I am trying to learn and explore it as much as possible.
I just caught with a requirement which does not satisfy with my current knowledge of tokenizers and analyzers.
Please let me know where I am doing wrong in achieving the requirement.
Thanks !

Ya I tried a lot with white-space/standard tokenizer but failed to achieve.
I have to achieve the exact match it means that if I searched for PL10 then i should be getting the records only which contains PL10 in it.
Same if I am eliminating for records. User might search for whole Post code as well, in that case also I have to check the exact match.
Just wanted to know what are things which needs to mentioned in custom analyzer.

Second approach does not seem to be working. I have done the wildcard query and stored the records using keyword tokenizer.
doing a query like regex PL10* not bringing any records.

yes because u have space in between (if any space or any special character is in between then regex wont work)
put this in ur field
obviously change the name of field

I guess this configuration will work same as what keyword tokenizer does.
Changing name of field and setting analyze = Analyze.NO will bring the results?

change the name of field according to your variable above was mine.
and if you provide Analyze.NO then it wont tokenize the value while indexing .

1 Like