Reputation: 537
I'm working on a project where I need to find Nearest Neighboors of am embedding vector. Recently, I'm tryng to use Google's new ANN tool SCANN github. I was able to create the searcher object and serialize it for an small dataset (~200K row with 512 values) with the following code
import numpy as np
import scann
data = np.random.random((200k,512))
data = data / np.linalg.norm(data, axis=1)[:, np.newaxis]
searcher = scann.scann_ops_pybind.builder(data, 10, "dot_product").tree(
num_leaves=2000, num_leaves_to_search=100, training_sample_size=250000).score_ah(
2, anisotropic_quantization_threshold=0.2).reorder(100).build()
searcher.serialize('./scann')
But when I've tried with the real dataset (~48M rows with 512 values), I got:
In [11]: searcher.serialize('scann/')
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-11-71a5ef71c81f> in <module>
----> 1 searcher.serialize('scann/')
~/.local/lib/python3.6/site-packages/scann/scann_ops/py/scann_ops_pybind.py in serialize(self, artifacts_dir)
70
71 def serialize(self, artifacts_dir):
---> 72 self.searcher.serialize(artifacts_dir)
73
74
MemoryError: std::bad_alloc
The size of the .npy
file for the dataset is ~90GB and I have at least 500GB of free RAM left on my computer and 1TB of free disk:
I'm running Ubuntu 18.04.5 LTS and Python 3.6.9. The Scann module was instaled with Pip.
Any ideas of what can be going on?
Thanks for the help
[edit] After @MSalters comment, I did some testing and find out that if the dataset to be serialized has more than 16777220 bytes (2^24+4) it fails with the bad_alloc
message. I still don't know why this happens ...
[edit2] I build the SCANN from source, and put some debug prints in it. The error seems to be on this line:
vector<uint8_t> storage(hash_dim * expected_size);
and if I print it like this:
std::cout << hash_dim << " " << expected_size <<"\n" << std::flush;
std::cout << hash_dim * expected_size <<"\n" << std::flush;
vector<uint8_t> v2;
std::cout << v2.max_size() << "\n" << std::flush;
vector<uint8_t> storage(hash_dim * expected_size);
std::cout << "after storage creation\n" << std::flush;
Then I get;
256 8388608
-2147483648
9223372036854775807
Upvotes: 3
Views: 353
Reputation: 85276
There seems to be an existing issue report in ScaNN, #427, with a similar error.
Based on the output of -2147483648
for std::cout << hash_dim * expected_size
we can conclude that hash_dim * expected_size
overflows.
Looking at the source we see the type of both hash_dim
and expected_size
is int
.
So probably the type of at least one of these should have been int64_t
, long long
or, better yet, size_t
.
By looking at the source of ScaNN it seems there might be more places that could benefit from a data type specifically designed to hold a size (size_t
) instead of an int
.
Upvotes: 1