C++ out of memory in python, plenty of space left

Question

I'm working on a project where I need to find Nearest Neighboors of am embedding vector. Recently, I'm tryng to use Google's new ANN tool SCANN github. I was able to create the searcher object and serialize it for an small dataset (~200K row with 512 values) with the following code

import numpy as np
import scann
data = np.random.random((200k,512))
data = data / np.linalg.norm(data, axis=1)[:, np.newaxis]
searcher = scann.scann_ops_pybind.builder(data, 10, "dot_product").tree(
    num_leaves=2000, num_leaves_to_search=100, training_sample_size=250000).score_ah(
    2, anisotropic_quantization_threshold=0.2).reorder(100).build()
searcher.serialize('./scann')

But when I've tried with the real dataset (~48M rows with 512 values), I got:

In [11]: searcher.serialize('scann/')
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
 in 
----> 1 searcher.serialize('scann/')

~/.local/lib/python3.6/site-packages/scann/scann_ops/py/scann_ops_pybind.py in serialize(self, artifacts_dir)
     70
     71   def serialize(self, artifacts_dir):
---> 72     self.searcher.serialize(artifacts_dir)
     73
     74

MemoryError: std::bad_alloc

The size of the .npyfile for the dataset is ~90GB and I have at least 500GB of free RAM left on my computer and 1TB of free disk:

I'm running Ubuntu 18.04.5 LTS and Python 3.6.9. The Scann module was instaled with Pip.

Any ideas of what can be going on?

Thanks for the help

[edit] After @MSalters comment, I did some testing and find out that if the dataset to be serialized has more than 16777220 bytes (2^24+4) it fails with the bad_alloc message. I still don't know why this happens ...

[edit2] I build the SCANN from source, and put some debug prints in it. The error seems to be on this line:

vector storage(hash_dim * expected_size);

and if I print it like this:

std::cout << hash_dim <<  " " << expected_size <<"
" << std::flush;
std::cout << hash_dim * expected_size <<"
" << std::flush;
vector v2;
std::cout << v2.max_size() << "
" << std::flush;
vector storage(hash_dim * expected_size);
std::cout << "after storage creation
" << std::flush;

Then I get;

256 8388608
-2147483648
9223372036854775807

C++ out of memory in python, plenty of space left

Answers (1)

Related Questions