Feulo
Feulo

Reputation: 537

C++ out of memory in python, plenty of space left

I'm working on a project where I need to find Nearest Neighboors of am embedding vector. Recently, I'm tryng to use Google's new ANN tool SCANN github. I was able to create the searcher object and serialize it for an small dataset (~200K row with 512 values) with the following code

import numpy as np
import scann
data = np.random.random((200k,512))
data = data / np.linalg.norm(data, axis=1)[:, np.newaxis]
searcher = scann.scann_ops_pybind.builder(data, 10, "dot_product").tree(
    num_leaves=2000, num_leaves_to_search=100, training_sample_size=250000).score_ah(
    2, anisotropic_quantization_threshold=0.2).reorder(100).build()
searcher.serialize('./scann')

But when I've tried with the real dataset (~48M rows with 512 values), I got:

In [11]: searcher.serialize('scann/')
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-11-71a5ef71c81f> in <module>
----> 1 searcher.serialize('scann/')

~/.local/lib/python3.6/site-packages/scann/scann_ops/py/scann_ops_pybind.py in serialize(self, artifacts_dir)
     70
     71   def serialize(self, artifacts_dir):
---> 72     self.searcher.serialize(artifacts_dir)
     73
     74

MemoryError: std::bad_alloc

The size of the .npyfile for the dataset is ~90GB and I have at least 500GB of free RAM left on my computer and 1TB of free disk:

enter image description here

I'm running Ubuntu 18.04.5 LTS and Python 3.6.9. The Scann module was instaled with Pip.

Any ideas of what can be going on?

Thanks for the help

[edit] After @MSalters comment, I did some testing and find out that if the dataset to be serialized has more than 16777220 bytes (2^24+4) it fails with the bad_alloc message. I still don't know why this happens ...

[edit2] I build the SCANN from source, and put some debug prints in it. The error seems to be on this line:

vector<uint8_t> storage(hash_dim * expected_size);

and if I print it like this:

std::cout << hash_dim <<  " " << expected_size <<"\n" << std::flush;
std::cout << hash_dim * expected_size <<"\n" << std::flush;
vector<uint8_t> v2;
std::cout << v2.max_size() << "\n" << std::flush;
vector<uint8_t> storage(hash_dim * expected_size);
std::cout << "after storage creation\n" << std::flush;

Then I get;

256 8388608
-2147483648
9223372036854775807

Upvotes: 3

Views: 353

Answers (1)

rustyx
rustyx

Reputation: 85276

There seems to be an existing issue report in ScaNN, #427, with a similar error.

Based on the output of -2147483648 for std::cout << hash_dim * expected_size we can conclude that hash_dim * expected_size overflows.

Looking at the source we see the type of both hash_dim and expected_size is int.

So probably the type of at least one of these should have been int64_t, long long or, better yet, size_t.

By looking at the source of ScaNN it seems there might be more places that could benefit from a data type specifically designed to hold a size (size_t) instead of an int.

Upvotes: 1

Related Questions