Reputation: 3632
I would like to use Lucene to run a nearest neighbour search. I'm using Lucene 9.0.0 on JVM 11. I did not find much documentation and mainly tried to piece things together using the existing tests.
I wrote a small test which prepares a HnswGraph
but so far the search does not yield the expected result. I setup a set of random vectors and add a final vector (0.99f,0.01f) which is very close to my search target.
The search unfortunately never returns the expected value. I'm not sure where my error is. I assume it may be related with the insert and document id order.
Maybe someone who is more familar with lucene might be able to provide some feedback. Is my approach correct? I'm using Documents only for persistence.
HnswGraphBuilder builder = new HnswGraphBuilder(vectors, similarityFunction, maxConn, beamWidth, seed);
HnswGraph hnsw = builder.build(vectors);
// Run a search
NeighborQueue nn = HnswGraph.search(
new float[] { 1, 0 },
10,
10,
vectors.randomAccess(), // ? Why do I need to specify the graph values again?
similarityFunction, // ? Why can I specify a different similarityFunction for search. Should that not be the same that was used for graph creation?
hnsw,
null,
new SplittableRandom(RandomUtils.nextLong()));
The whole test source can be found here: https://gist.github.com/Jotschi/cea21a72412bcba80c46b967e9c52b0f
Upvotes: 6
Views: 1718
Reputation: 1
I was looking for an answer for Pylucene 10.0.0. Thanks to the Oliv's answer, I am able to implement the HnswGraphBuilder on Pylucene. I'm sharing my code here in case it helps others facing similar challenges since there are not many Pylucene examples around.
import lucene
from org.apache.lucene.util.hnsw import HnswGraph, HnswGraphSearcher, HnswGraphBuilder, RandomVectorScorerSupplier
from org.apache.lucene.index import VectorSimilarityFunction
from org.apache.lucene.codecs.hnsw import DefaultFlatVectorScorer
from org.apache.lucene.index import FloatVectorValues
from java.util import ArrayList
lucene.initVM(vmargs=['-Djava.awt.headless=true'])
sim_func = VectorSimilarityFunction.COSINE
max_conn = 16
beam_width = 100
seed = 423
vector_dim = len(vectors[0]) # vector dimension
default_scorer = DefaultFlatVectorScorer()
java_array = numpy_to_java_float_array(vectors, vector_dim)
vector_values = FloatVectorValues.fromFloats(java_array, vector_dim)
supplier = default_scorer.getRandomVectorScorerSupplier(sim_func, vector_values)
hnsw_graph_builder = HnswGraphBuilder.create(supplier, max_conn, beam_width, seed)
hnsw = hnsw_graph_builder.build(len(vectors))
Since Pylucene is a wrapper for Java, we need to convert numpy arrays to the Java List.
def numpy_to_java_float_array(numpy_array, vector_dim):
"""
The method converts numpy array to the Java Array using Java API
Arguments:
- numpy_array (np.array): Numpy array that contains the embedding vectors
- vector_dim (int): Vector dimension
Returns:
java_list (List): Embedding vectors as Java List
"""
if numpy_array.dtype != np.float32:
numpy_array = numpy_array.astype(np.float32)
if numpy_array.ndim != 2 or numpy_array.shape[1] != vector_dim:
raise ValueError(f"Input array must have shape (n, {vector_dim}).")
java_list = ArrayList()
# Populate the Java ArrayList with Java float arrays
for row in numpy_array:
java_float_array = lucene.JArray('float')(row.tolist())
java_list.add(java_float_array)
return java_list
Upvotes: 0
Reputation: 10822
This version works since Lucene 10.0. It uses the lower level APIs:
// Create a random vector universe
int vectorDimensions = 2;
int universeSize = 2_000;
List<float[]> universe = new ArrayList<>(universeSize);
for (var i = 0; i < universeSize; i++) {
universe.add(randomVector(vectorDimensions));
}
// construct a HNSW graph of the universe
System.out.println("Constructing HNSW graph...");
DefaultFlatVectorScorer defaultFlatVectorScorer = new DefaultFlatVectorScorer();
FloatVectorValues vectorValues = FloatVectorValues.fromFloats(universe, vectorDimensions);
RandomVectorScorerSupplier scorerSupplier =
defaultFlatVectorScorer.getRandomVectorScorerSupplier(EUCLIDEAN, vectorValues);
HnswGraphBuilder builder = HnswGraphBuilder.create(scorerSupplier, DEFAULT_BEAM_WIDTH,
DEFAULT_MAX_CONN, ThreadLocalRandom.current().nextInt());
OnHeapHnswGraph hnsw = builder.build(vectorValues.size());
// search for the nearest neighbors of a random vector
float[] queryVector = randomVector(vectorDimensions);
System.out.println("Searching for top 10 neighbors of a random vector");
RandomVectorScorer queryScorer =
defaultFlatVectorScorer.getRandomVectorScorer(EUCLIDEAN, vectorValues, queryVector);
KnnCollector nn = HnswGraphSearcher.search(queryScorer, 10, hnsw, null,
Integer.MAX_VALUE);
for (ScoreDoc scoreDoc : nn.topDocs().scoreDocs) {
System.out.printf(" ordinal %d (similarity: %s)%n", scoreDoc.doc, scoreDoc.score);
}
Upvotes: 0
Reputation: 19377
Since this is a top search hit for people like me looking for lucene hnsw examples (there just aren't many out there), here's what this looks like as of Lucene 9.6, without bringing in higher-level Lucene classes:
// Create a random vector universe
var vectorDimensions = 1500;
var universeSize = 2_000;
var universe = new ArrayList<float[]>(universeSize);
for (var i = 0; i < universeSize; i++) {
universe.add(randomVector(vectorDimensions));
}
// construct a HNSW graph of the universe
System.out.println("Constructing HNSW graph...");
var ravv = new ListRandomAccessVectorValues(universe, vectorDimensions);
var builder = HnswGraphBuilder.create(ravv, VectorEncoding.FLOAT32, similarityFunction, 16, 100, random.nextInt());
var hnsw = builder.build(ravv.copy());
// search for the nearest neighbors of a random vector
var queryVector = randomVector(vectorDimensions);
System.out.println("Searching for top 10 neighbors of a random vector");
var nn = HnswGraphSearcher.search(queryVector, 10, ravv.copy(), VectorEncoding.FLOAT32, similarityFunction, hnsw, null, Integer.MAX_VALUE);
for (var i : nn.nodes()) {
var neighbor = universe.get(i);
var similarity = similarityFunction.compare(queryVector, neighbor);
System.out.printf(" ordinal %d (similarity: %s)%n", i, similarity);
}
The implementation of ListRandomAccessVectorValues
is straightforward and can be found here: https://github.com/jbellis/hnswdemo
As you note, the API is a bit wonky, and while you can specify different encodings or similarities to the search than from the builder, you will obviously get nonsense results.
The reason you have to pass the vector values in to the search function is that for efficiency, the index itself does not store a copy of the vectors themselves, but only an int offset into the values provider.
Upvotes: 2
Reputation: 3632
Another way to solve this is to use the KnnVectorQuery.
try (IndexReader reader = DirectoryReader.open(dir)) {
IndexSearcher searcher = new IndexSearcher(reader);
System.out.println("Query: [" + String.format("%.2f", queryVector[0]) + ", " + String.format("%.2f", queryVector[1]) + "]");
TopDocs results = searcher.search(new KnnVectorQuery("field", queryVector, 3), 10);
System.out.println("Hits: " + results.totalHits);
for (ScoreDoc sdoc : results.scoreDocs) {
Document doc = reader.document(sdoc.doc);
StoredField idField = (StoredField) doc.getField("id");
System.out.println("Found: " + idField.numericValue() + " = " + String.format("%.1f", sdoc.score));
}
}
Full example: https://gist.github.com/Jotschi/7d599dff331d75a3bdd02e62f65abfba
Upvotes: 1
Reputation: 3632
I managed to get this working.
Instead of using the HnswGraph
API directly I now use LeafReader#searchNearestVectors
. While debugging I noticed that the Lucene90HnswVectorsWriter
for example invokes extra steps using the HnswGraph
API. I assume this is done to create a correlation between inserted vectors and document Ids. The nodeIds I retrieved using a HnswGraph#search
never matched up with the matched up with the vector Ids. I don't know whether extra steps are needed to setup the graph or whether the correlation needs to be created afterwards somehow.
The good news is that the LeafReader#searchNearestVectors
method works. I have updated the example which now also makes use of the Lucene documents.
@Test
public void testWriteAndQueryIndex() throws IOException {
// Persist and read the data
try (MMapDirectory dir = new MMapDirectory(indexPath)) {
// Write index
int indexedDoc = writeIndex(dir, vectors);
// Read index
readAndQuery(dir, vectors, indexedDoc);
}
}
Vector 7 with [0.97|0.02] is very close to the search query target [0.98|0.01].
Test vectors:
0 => [0.13|0.37]
1 => [0.99|0.49]
2 => [0.98|0.57]
3 => [0.23|0.64]
4 => [0.72|0.92]
5 => [0.08|0.74]
6 => [0.50|0.27]
7 => [0.97|0.02]
8 => [0.90|0.21]
9 => [0.89|0.09]
10 => [0.11|0.95]
Doc Based Search:
Searching for NN of [0.98 | 0.01]
TotalHits: 11
7 => [0.97|0.02]
9 => [0.89|0.09]
Full example: https://gist.github.com/Jotschi/d8a91758c84203d172f818c8be4964e4
Upvotes: 1