Sort by maximum similarity between vectors


I am currently trying to implement a dense passage retrieval system. Essentially I want to perform a knn search on Entities that have multiple Embeddings associated with them.

I want the entites in the search result to be sorted by the maximum embedding similarity.
My current implementation seems to sort the entites by the average embedding similarity.

Consider a scenario like this:

public class Book {
        private Integer id;
        @OneToMany(mappedBy = "book", fetch = FetchType.EAGER)
        @IndexedEmbedded(structure = ObjectStructure.NESTED)
        private List<Embedding> bookEmbeddings;

        // Other properties ...
public class Embedding {
        private Integer id;
        private Book book;
        @VectorField(dimension = 768, vectorSimilarity = VectorSimilarity.COSINE, searchable = Searchable.YES)
        private float[] embedding;

        // Other properties ...
float[] queryEmbedding = /*...*/

List<Book> hits = Book.class )
.where( f ->
    f.knn( 5 )
        .field( "bookEmbeddings.embedding" )
        .matching( queryEmbedding )
).fetchHits( 20 );

In this example one book has multiple emeddings associated with it.
When searching for books with an embedding I want the first result to be the book that has the most similair embedding associated with it.
Currently the average embedding similarity seems to be used for sorting.

Is it possible to achieve this behaviour with hibernate search?


I don’t think that there’s much influence on how the knn predicate behaves currently :confused:. There is an ongoing discussion on the Lucene side on how to deal with multi-valued vector fields (Multi-value Support for KnnVectorField · Issue #12313 · apache/lucene · GitHub), and similar use cases to yours are mentioned there. Once Lucene has something implemented, we can explore exposing it through Hibernate Search. Can’t think of something that would change the behavior with what’s currently available.