Sorting bridge causes mass indexer speed degradation

My entities’ sortable fields only get their first token set in the index.
2019-07-08 13:20:21,070 [WARN ] org.hibernate.search.util.impl.InternalAnalyzerUtils - HSEARCH000321: The analysis of field 'someField' produced multiple tokens. Tokenization or term generation (synonyms) should not be used on sortable fields. Only the first token will be indexed.

Only being able to sort on the first token is problematic, so I’ve created a string sort bridge, to increase this from 1 token to 2 tokens:

package my.foo;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.SortedDocValuesField;
import org.apache.lucene.util.BytesRef;
import org.hibernate.search.bridge.LuceneOptions;
import org.hibernate.search.bridge.MetadataProvidingFieldBridge;
import org.hibernate.search.bridge.TwoWayFieldBridge;
import org.hibernate.search.bridge.spi.FieldMetadataBuilder;
import org.hibernate.search.bridge.spi.FieldType;

public class StringSortBridge implements TwoWayFieldBridge, MetadataProvidingFieldBridge {
	public static final String SORT_SUFFIX = "_sort";
	
	@Override
	public void set(String name, Object value, Document document, LuceneOptions luceneOptions) {
		String valueAsString = value != null ? value + "" : Integer.MIN_VALUE + ""; // use null proxy for querying empty fields
		
		luceneOptions.addFieldToDocument(name, valueAsString, document);

		if (document.getField(name + SORT_SUFFIX) == null)
			document.add(new SortedDocValuesField(name + SORT_SUFFIX, new BytesRef(getWordsUpTo(valueAsString, 2))));
	}

	@Override
	public Object get(String name, Document document) {
		return document.get(name);
	}

	@Override
	public String objectToString(Object object) {
		return object.toString();
	}

	@Override
	public void configureFieldMetadata(String name, FieldMetadataBuilder builder) {
		builder
			.field(name, FieldType.STRING) // used in searches
				.sortable(false)
			.field(name + SORT_SUFFIX, FieldType.STRING) // used for sorting search results
				.sortable(true);
	}
	
	// use first N tokens for sorting
	public String getWordsUpTo(String value, int numberOfWords) {
		String[] tokens = value == null ? new String[] {} : value.split(" ");

		if (tokens.length <= numberOfWords)
			return value;
		
		value = "";
		
		int cur = 0;
		for (String token : tokens) {
			value += (token + " ");
			if ((++cur) == numberOfWords)
				break;
		}
		
		return value;
	}
}

I have changed my JPA entity string fields from:

	@Field(indexNullAs = Integer.MIN_VALUE + "") @SortableField
	@Column(name = "FOO")
	private String someField;

to

	@Field @FieldBridge(impl = StringSortBridge.class)
	@Column(name = "FOO")
	private String someField;

This sorting trick works, but the mass indexer’s indexing time has increased from 8 hours to 2+ days! The index sizes are only marginally larger, so I don’t understand how it can be so much slower.

Is there something obviously wrong in my above approach? I don’t see any errors in the logs. It’s just slower… for no obvious reasons.

You’re basically reimplementing analysis yourself, which is generally not a good idea, both for correctness and performance. It’s better to leave this to Lucene’s Analyzers, which is heavily optimized.

In the specific case of sortable fields, the solution is not to write analysis yourself, but to use another analyzer. And, since we don’t want tokenization in a sortable field, you will have to use a specific type an analyzer, called a normalizer: it’s basically an analyzer without a tokenizer.

In short, your field mapping will look like this:

	@Field
	@Field(name = "someField_sort", normalizer = "myNormalizer")
	@SortableField(forField = "someField_sort", index = Index.NO)
	@Column(name = "FOO")
	private String someField;

As you can see, you declare two fields: one that uses your (default) analyzer and is indexed so you can query it, and the other that uses a specific normalizer and is declared as sortable.

You will simply have to define the normalizer, which works mostly like analyzer definitions, except you can’t assign a tokenizer.

See the section about normalizer in the documentation for more information.

Hi Yoann:
Thanks for the reply. I’ve seen your suggested solution in other forums (and in your documentation), but I was hoping to avoid it, since it contains a lot of boilerplate annotations that will bloat all our entities (which contain hundreds of fields each).
I’m also hesitant to create hardcoding strings representations (e.g., “myField_sort”) for each unique field in our entities. It’s boilerplate on top of boilerplate.

Is there really no other efficient solution?

I mean, worst case, I’ll have to do as you suggest, but is there at least a way to avoid hardcoding fields (e.g. myField_sort) for every field in our entities?

None that I would use, no. At least not in Search 5. In Search 6 you’ll be able to assign a normalizer to use for each field from within the bridge, but in Search 5 you cannot do that.

Depending on the amount of time you’re willing to spend on this, you could define your own annotations, and implement a parser that defines a mapping programmatically based on your annotations. You will be mostly on your own for the parsing side of the problem, though: Hibernate Search provides nothing to facilitate that.

You could also try improving the efficiency of your bridge.

For your set method, I would avoid the call to document.getField, especially if you know you will only ever index single-valued properties. getField is not very efficient, especially on large documents, since it contains a loop on every field of the document.

For your getWordsUpTo method:

  • Use a StringTokenizer instead of String.split
  • Concatenate the tokens using a StringBuilder instead of building a new string for each concatenation.
  • return new BytesRef(stringBuilder) directly instead of creating an intermediary String that will then be converted to bytes.

But really that’s a shot in the dark. Efficient text processing is not as obvious as it might seem.
Also your current solution will not allow case-insensitive sorts, so… there’s that.

Also, before you spend more time on this, I would suggest testing whether the solution I suggested is faster or not… If there’s really a massive amount of text to put in doc values, it’s possible the slowdown is located in Lucene itself, not your bridge.

With my approach and with a normalizer, if there are really big chunks of text, you might want to use LengthFilterFactory to just ignore the text after the first few N characters. You generally don’t need a million characters when you just want to sort.

By the way, if that’s what you intended to do with your getWordsUpTo method, I’d recommend doing that, too: just take the first N characters, don’t worry about the tokens. It might speed up the process.

Hi Yoann:
Thanks for your help and additional suggestions. I will try some more to make my existing solution more efficient, using all your suggested optimizations.

Yes. The getWordsUpTo method is meant to limit the sorting values. I’ll take your suggestion and simply pass a small substring of the original field value, and not worry about passing the first N words.

If it still doesn’t work, worst case, I’ll fall back on your annotation-based solution (but, as you said, that solution may not work either, if the problem is in Lucene itself).

Thanks again.

Hi Yoann:
I overlooked that you provided another solution besides annotations: the programming API. That’s certainly the best solution to avoiding too many annotations. Seems like a lot of work up front, but that could be another solution (again, assuming that the problem is not with Lucene itself).
Thanks.
D

@yrodiere, so I gave up on trying to get my StringSortBridge class to work. Instead, I decided to use your annotation-based solution.

I’ve done this:

@NormalizerDef(name = "sort", filters = @TokenFilterDef(factory = LowerCaseFilterFactory.class))
public class MyEntity {
	...
	@Field
	@Field(name = "someField_sort", normalizer = @Normalizer(definition = "sort"))
	@SortableField(forField = "someField_sort")
	@Column(name = "FOO")
	private String someField;
	...
}

But, is there some way I can avoid having to define the normalizer on top of every entity, and instead put it into its own class? I can’t see any documentation that demonstates this. Is it possible?

It’s a bit hidden, but as explained in 1.8. Analyzer:

analyzer definitions are global and their names must be unique

Same goes with normalizer definitions. You only need to define the normalizer on one entity, and it will be available for all entities.

If you want to avoid defining the normalizer on a particular entity, you have several options:

  • Put the @NormalizerDef annotation on a package, within a package-info.java file. The package must contain at least one @Indexed entity to be considered by Hibernate Search.
  • Create an analysis definition provider to define analyzers programmatically.

Thanks @yrodiere.
If I define the normalizer, in its own class (by implementing LuceneAnalysisDefinitionProvider), what parameter would I need to put in the hibernate cfg file, to make that normalizer visible to my entities?

Also, would this normalizer be automatically inherited by all @Field annotations, or will have the freed to define the normalizer only on the fields I want? E.g.,
@Field(name = "someField_sort", normalizer = @Normalizer(definition = "sort"))

Not sure what you want to know, this is litteraly explained in the first few lines of the link I gave you? Or maybe I misunderstood your question?

The other way to define analyzers is programmatically. You can of course use the programmatic mapping API to do so, but an easier way may be to use the hibernate.search.lucene.analysis_definition_provider configuration property.

This property can be set to the fully-qualified name of a class with a public, no-arg constructor in your application. This class must either implement org.hibernate.search.analyzer.definition.LuceneAnalysisDefinitionProvider directly or expose a @Factory -annotated method that returns such implementation.

No, the normalizer will not be automatically “inherited”. Actually the definition alone will not even affect a single entity: you need to reference it from @Field for the normalizer to be used.

Definition and assignment are two separate things, allowing you to pick a different normalizer for each field.

Hi @yrodiere. Sorry for not being clearer about what I mean.
I already use the hibernate.search.lucene.analysis_definition_provider param for another analyzer that I apply globally to all search fields:

package my.package;

import org.apache.lucene.analysis.core.LowerCaseFilterFactory;
import org.apache.lucene.analysis.core.StopFilterFactory;
import org.apache.lucene.analysis.core.WhitespaceTokenizerFactory;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilterFactory;
import org.apache.lucene.analysis.standard.StandardTokenizerFactory;
import org.hibernate.search.analyzer.definition.LuceneAnalysisDefinitionProvider;
import org.hibernate.search.analyzer.definition.LuceneAnalysisDefinitionRegistryBuilder;

public class MySearchAnalyzer implements LuceneAnalysisDefinitionProvider {
    @Override
    public void register(LuceneAnalysisDefinitionRegistryBuilder builder) {
        builder
        	.analyzer("mySearchAnalyzer")
			.tokenizer(StandardTokenizerFactory.class)
        		.tokenFilter(LowerCaseFilterFactory.class)
        		.tokenFilter(StopFilterFactory.class)
        			.param("words", "stoplist.properties")
                    .param("ignoreCase", "true")
                .tokenFilter(ASCIIFoldingFilterFactory.class)
        ;
    }
}

And defined in hibernate cfg using these props:

"hibernate.search.lucene.analysis_definition_provider", "my.package.MySearchAnalyzer"
"hibernate.search.analyzer", "mySearchAnalyzer"

Is it legal for me to build both my analyzer AND my normalizer in the same register method of my existing MySearchAnalyzer class?

Yes, absolutely. That’s how it’s intended to be used.