Sort by maximum similarity between vectors

felixs · May 29, 2024, 12:06pm

Hello,

I am currently trying to implement a dense passage retrieval system. Essentially I want to perform a knn search on Entities that have multiple Embeddings associated with them.

I want the entites in the search result to be sorted by the maximum embedding similarity.
My current implementation seems to sort the entites by the average embedding similarity.

Consider a scenario like this:

@Indexed
public class Book {
        @Id
        private Integer id;
        
        @OneToMany(mappedBy = "book", fetch = FetchType.EAGER)
        @IndexedEmbedded(structure = ObjectStructure.NESTED)
        private List<Embedding> bookEmbeddings;

        // Other properties ...
}

@Entity
public class Embedding {
        @Id
        private Integer id;
        
        @ManyToOne
        private Book book;
    
        @VectorField(dimension = 768, vectorSimilarity = VectorSimilarity.COSINE, searchable = Searchable.YES)
        private float[] embedding;

        // Other properties ...
}

float[] queryEmbedding = /*...*/

List<Book> hits = searchSession.search( Book.class )
.where( f ->
    f.knn( 5 )
        .field( "bookEmbeddings.embedding" )
        .matching( queryEmbedding )
).fetchHits( 20 );

In this example one book has multiple emeddings associated with it.
When searching for books with an embedding I want the first result to be the book that has the most similair embedding associated with it.
Currently the average embedding similarity seems to be used for sorting.

Is it possible to achieve this behaviour with hibernate search?

mbekhta · May 29, 2024, 1:54pm

Hello!

I don’t think that there’s much influence on how the knn predicate behaves currently . There is an ongoing discussion on the Lucene side on how to deal with multi-valued vector fields (Multi-value Support for KnnVectorField · Issue #12313 · apache/lucene · GitHub), and similar use cases to yours are mentioned there. Once Lucene has something implemented, we can explore exposing it through Hibernate Search. Can’t think of something that would change the behavior with what’s currently available.

Topic		Replies	Views
Will multivalued VectorFields be possible in the future? Hibernate Search	2	144	April 8, 2024
Sorting based on relevance score and field Hibernate Search	1	638	July 5, 2021
About IndexedEmbedded annotation and collection Hibernate Search	4	2525	August 23, 2019
Querying IndexedEmbedded entities (sometimes) returning wrong result Hibernate Search	3	374	September 13, 2021
How to get relevance score of each item in the returned result? Hibernate Search	2	355	March 3, 2022

Sort by maximum similarity between vectors

Related Topics