Reputation: 435
Is there an implementation of vector similarity search that works in iOS?
I have a set of ~10K+ vectors. When I get a new vector, I'd like to find the top K vectors from the set that are most similar to the new vector.
This can be done either in a loop: compute similarity between every vector in the set and the new vector, and pick the top K.
There are more efficient ways (e.g. FAISS or ScanNN) that can scale to millions of vectors, but they don't work on iOS. Is there an implementation of vector similarity search that works in iOS?
Edit 1: I've found this project https://github.com/hora-search/hora-ios, but it looks dead. Has anyone tried it out?
Edit 2: On ~10K vectors, brute forcing works... The question remains open for cases with much more vectors
Upvotes: 2
Views: 1264
Reputation: 11
Depending on your constraints, a vector database like Pinecone could give you what you need.
Upvotes: 0
Reputation: 531
For me, if the size of vectors to be queried is about 10k+, I'd like to use the brute force query method for it. Actually, in iOS accelerated framework, there are cblas_ functions(maybe currently marked as deprecated in the API reference). As the FlatIndex in faiss actually does Gemm and then TopK selection.
I suggest it could be started with cblas_sgemm + heap sort(topk) for a start. In my point of view, on the mobile platform, the power-efficient may be more important, so prefer to call native API from vendors.
// return random vector with L2 norm
func randFloatArray(dims: Int, count: Int) -> [Float]
{
var values = [Float](repeating: 0.0, count: dims * count)
for n in 0...count-1
{
var sum: Float = 0.0
let offset = n * dims
for index in 0...dims-1
{
let val = Float(drand48());
sum += val * val
values[offset+index] = val
}
let sum_sqrt = sqrt(sum)
for index in 0...dims-1
{
values[offset+index] /= sum_sqrt
}
}
return values
}
class Vectors {
struct QueryResult {
var v: Float
var idx: Int64
}
var data: [Float]
var ids: [Int64]
var dims: Int32
init (dims: Int) {
self.data = [Float]()
self.ids = [Int64]()
self.dims = Int32(dims)
}
// heap sort for select top k
func heapSort(_ array: [QueryResult], topk: Int) -> [QueryResult] {
var heap = Array(array.prefix(topk))
func siftDown(_ start: Int, _ end: Int) {
var root = start
while root * 2 + 1 <= end {
let child = root * 2 + 1
var swap = root
if heap[swap].v > heap[child].v {
swap = child
}
if child + 1 <= end && heap[swap].v > heap[child + 1].v {
swap = child + 1
}
if swap == root {
return
} else {
heap.swapAt(root, swap)
root = swap
}
}
}
let count = heap.count
for i in stride(from: count / 2 - 1, through: 0, by: -1) {
siftDown(i, count - 1)
}
for i in stride(from: count, to: array.count, by: 1) {
if array[i].v > heap[0].v {
heap[0] = array[i]
siftDown(0, count - 1)
}
}
return heap.sorted(by:{$0.v > $1.v})
}
func insert(vec: [Float], ids: [Int64])
{
self.data.append(contentsOf: vec)
self.ids.append(contentsOf: ids)
}
func count() -> Int
{
return data.count / Int(dims)
}
// query, return (results, gemm time, k select time)
func query(vec: [Float], topk: Int) -> ([QueryResult], UInt64, UInt64)
{
var c = [Float](repeating: 0.0, count: Int(self.count()))
var results = [QueryResult](repeating: QueryResult(v: 0.0, idx: -1), count: Int(self.count()))
let t0 = UInt64(Date().timeIntervalSince1970 * 1000)
vec.withUnsafeBufferPointer { vec in
data.withUnsafeBufferPointer {data in
c.withUnsafeMutableBufferPointer {c in
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, 1, Int32(self.count()), dims, 1.0, vec.baseAddress, dims, data.baseAddress, dims, 0.0, c.baseAddress, Int32(self.count()))
}
}
}
let t1 = UInt64(Date().timeIntervalSince1970 * 1000)
// prepare result before sort
for index in 0...self.count()-1 {
results[index] = QueryResult(v:c[index], idx: self.ids[index])
}
// sort for poc, using heap sort for tokk is more efficient
results = heapSort(results, topk: topk)
let t2 = UInt64(Date().timeIntervalSince1970 * 1000)
return (results, t1-t0, t2-t1)
}
}
with 256 dimensions, input 1 vector query top 3 in 100K records.
Upvotes: 0