Jotschi
Jotschi

Reputation: 3632

Howto run Nearest Neighbour Search with Lucene HnswGraph

I would like to use Lucene to run a nearest neighbour search. I'm using Lucene 9.0.0 on JVM 11. I did not find much documentation and mainly tried to piece things together using the existing tests.

I wrote a small test which prepares a HnswGraph but so far the search does not yield the expected result. I setup a set of random vectors and add a final vector (0.99f,0.01f) which is very close to my search target. The search unfortunately never returns the expected value. I'm not sure where my error is. I assume it may be related with the insert and document id order.

Maybe someone who is more familar with lucene might be able to provide some feedback. Is my approach correct? I'm using Documents only for persistence.

HnswGraphBuilder builder = new HnswGraphBuilder(vectors, similarityFunction, maxConn, beamWidth, seed);
HnswGraph hnsw = builder.build(vectors);

// Run a search
NeighborQueue nn = HnswGraph.search(
    new float[] { 1, 0 },
    10,
    10,
    vectors.randomAccess(), // ? Why do I need to specify the graph values again?
    similarityFunction, // ? Why can I specify a different similarityFunction for search. Should that not be the same that was used for graph creation?
    hnsw,
    null,
    new SplittableRandom(RandomUtils.nextLong()));

The whole test source can be found here: https://gist.github.com/Jotschi/cea21a72412bcba80c46b967e9c52b0f

Upvotes: 6

Views: 1718

Answers (5)

Ekrem Polat
Ekrem Polat

Reputation: 1

I was looking for an answer for Pylucene 10.0.0. Thanks to the Oliv's answer, I am able to implement the HnswGraphBuilder on Pylucene. I'm sharing my code here in case it helps others facing similar challenges since there are not many Pylucene examples around.

import lucene
from org.apache.lucene.util.hnsw import HnswGraph, HnswGraphSearcher, HnswGraphBuilder, RandomVectorScorerSupplier
from org.apache.lucene.index import VectorSimilarityFunction
from org.apache.lucene.codecs.hnsw import DefaultFlatVectorScorer
from org.apache.lucene.index import FloatVectorValues
from java.util import ArrayList

lucene.initVM(vmargs=['-Djava.awt.headless=true'])

sim_func = VectorSimilarityFunction.COSINE
max_conn = 16
beam_width = 100
seed = 423
vector_dim = len(vectors[0]) # vector dimension
    
default_scorer = DefaultFlatVectorScorer()
java_array = numpy_to_java_float_array(vectors, vector_dim)

vector_values = FloatVectorValues.fromFloats(java_array, vector_dim)
supplier = default_scorer.getRandomVectorScorerSupplier(sim_func, vector_values)

hnsw_graph_builder = HnswGraphBuilder.create(supplier, max_conn, beam_width, seed)
hnsw = hnsw_graph_builder.build(len(vectors))

Since Pylucene is a wrapper for Java, we need to convert numpy arrays to the Java List.

def numpy_to_java_float_array(numpy_array, vector_dim):
   """
    The method converts numpy array to the Java Array using Java API
    
    Arguments:
        - numpy_array (np.array): Numpy array that contains the embedding vectors
        - vector_dim (int): Vector dimension
    Returns:
        java_list (List): Embedding vectors as Java List
   """
    
   if numpy_array.dtype != np.float32:
      numpy_array = numpy_array.astype(np.float32)

   if numpy_array.ndim != 2 or numpy_array.shape[1] != vector_dim:
      raise ValueError(f"Input array must have shape (n, {vector_dim}).")

   java_list = ArrayList()

   # Populate the Java ArrayList with Java float arrays
   for row in numpy_array:
      java_float_array = lucene.JArray('float')(row.tolist())
      java_list.add(java_float_array)

   return java_list

Upvotes: 0

Oliv
Oliv

Reputation: 10822

This version works since Lucene 10.0. It uses the lower level APIs:

// Create a random vector universe
int vectorDimensions = 2;
int universeSize = 2_000;
List<float[]> universe = new ArrayList<>(universeSize);
for (var i = 0; i < universeSize; i++) {
    universe.add(randomVector(vectorDimensions));
}

// construct a HNSW graph of the universe
System.out.println("Constructing HNSW graph...");
DefaultFlatVectorScorer defaultFlatVectorScorer = new DefaultFlatVectorScorer();
FloatVectorValues vectorValues = FloatVectorValues.fromFloats(universe, vectorDimensions);
RandomVectorScorerSupplier scorerSupplier =
        defaultFlatVectorScorer.getRandomVectorScorerSupplier(EUCLIDEAN, vectorValues);
HnswGraphBuilder builder = HnswGraphBuilder.create(scorerSupplier, DEFAULT_BEAM_WIDTH,
        DEFAULT_MAX_CONN, ThreadLocalRandom.current().nextInt());
OnHeapHnswGraph hnsw = builder.build(vectorValues.size());

// search for the nearest neighbors of a random vector
float[] queryVector = randomVector(vectorDimensions);
System.out.println("Searching for top 10 neighbors of a random vector");
RandomVectorScorer queryScorer =
        defaultFlatVectorScorer.getRandomVectorScorer(EUCLIDEAN, vectorValues, queryVector);
KnnCollector nn = HnswGraphSearcher.search(queryScorer, 10, hnsw, null,
        Integer.MAX_VALUE);
for (ScoreDoc scoreDoc : nn.topDocs().scoreDocs) {
    System.out.printf("  ordinal %d (similarity: %s)%n", scoreDoc.doc, scoreDoc.score);
}

Upvotes: 0

jbellis
jbellis

Reputation: 19377

Since this is a top search hit for people like me looking for lucene hnsw examples (there just aren't many out there), here's what this looks like as of Lucene 9.6, without bringing in higher-level Lucene classes:

// Create a random vector universe
var vectorDimensions = 1500;
var universeSize = 2_000;
var universe = new ArrayList<float[]>(universeSize);
for (var i = 0; i < universeSize; i++) {
    universe.add(randomVector(vectorDimensions));
}

// construct a HNSW graph of the universe
System.out.println("Constructing HNSW graph...");
var ravv = new ListRandomAccessVectorValues(universe, vectorDimensions);
var builder = HnswGraphBuilder.create(ravv, VectorEncoding.FLOAT32, similarityFunction, 16, 100, random.nextInt());
var hnsw = builder.build(ravv.copy());

// search for the nearest neighbors of a random vector
var queryVector = randomVector(vectorDimensions);
System.out.println("Searching for top 10 neighbors of a random vector");
var nn = HnswGraphSearcher.search(queryVector, 10, ravv.copy(), VectorEncoding.FLOAT32, similarityFunction, hnsw, null, Integer.MAX_VALUE);
for (var i : nn.nodes()) {
    var neighbor = universe.get(i);
    var similarity = similarityFunction.compare(queryVector, neighbor);
    System.out.printf("  ordinal %d (similarity: %s)%n", i, similarity);
}

The implementation of ListRandomAccessVectorValues is straightforward and can be found here: https://github.com/jbellis/hnswdemo

As you note, the API is a bit wonky, and while you can specify different encodings or similarities to the search than from the builder, you will obviously get nonsense results.

The reason you have to pass the vector values in to the search function is that for efficiency, the index itself does not store a copy of the vectors themselves, but only an int offset into the values provider.

Upvotes: 2

Jotschi
Jotschi

Reputation: 3632

Another way to solve this is to use the KnnVectorQuery.

try (IndexReader reader = DirectoryReader.open(dir)) {
    IndexSearcher searcher = new IndexSearcher(reader);
    System.out.println("Query: [" + String.format("%.2f", queryVector[0]) + ", " + String.format("%.2f", queryVector[1]) + "]");
    TopDocs results = searcher.search(new KnnVectorQuery("field", queryVector, 3), 10);
    System.out.println("Hits: " + results.totalHits);
    for (ScoreDoc sdoc : results.scoreDocs) {
        Document doc = reader.document(sdoc.doc);
        StoredField idField = (StoredField) doc.getField("id");
        System.out.println("Found: " + idField.numericValue() + " = " + String.format("%.1f", sdoc.score));
    }
}

Full example: https://gist.github.com/Jotschi/7d599dff331d75a3bdd02e62f65abfba

Upvotes: 1

Jotschi
Jotschi

Reputation: 3632

I managed to get this working.

Instead of using the HnswGraph API directly I now use LeafReader#searchNearestVectors. While debugging I noticed that the Lucene90HnswVectorsWriter for example invokes extra steps using the HnswGraph API. I assume this is done to create a correlation between inserted vectors and document Ids. The nodeIds I retrieved using a HnswGraph#search never matched up with the matched up with the vector Ids. I don't know whether extra steps are needed to setup the graph or whether the correlation needs to be created afterwards somehow.

The good news is that the LeafReader#searchNearestVectors method works. I have updated the example which now also makes use of the Lucene documents.

@Test
public void testWriteAndQueryIndex() throws IOException {
    // Persist and read the data
    try (MMapDirectory dir = new MMapDirectory(indexPath)) {
        // Write index
        int indexedDoc = writeIndex(dir, vectors);
        // Read index
        readAndQuery(dir, vectors, indexedDoc);
    }
}

Vector 7 with [0.97|0.02] is very close to the search query target [0.98|0.01].

Test vectors:
0 => [0.13|0.37]
1 => [0.99|0.49]
2 => [0.98|0.57]
3 => [0.23|0.64]
4 => [0.72|0.92]
5 => [0.08|0.74]
6 => [0.50|0.27]
7 => [0.97|0.02]
8 => [0.90|0.21]
9 => [0.89|0.09]
10 => [0.11|0.95]

Doc Based Search:
Searching for NN of [0.98 | 0.01]
TotalHits: 11
7 => [0.97|0.02]
9 => [0.89|0.09]

Full example: https://gist.github.com/Jotschi/d8a91758c84203d172f818c8be4964e4

Upvotes: 1

Related Questions