Reputation: 1408
I am trying to read a large .npy file in CSharp. In order to do that i am trying to use the NumSharp nuget.
The file is 7GB jagged float array (float[][]). It has ~1 million vectors, each vector is a 960 dimension.
Note: To be more specific the data I use is the GIST from the following link Approximate Nearest Neighbors Large datasets.
The following is the method I use to load the data but it failes with an exception:
private static void ReadNpyVectorsFromFile(string pathPrefix, out List<float[]> candidates)
{
var npyFilename = @$"{pathPrefix}.npy";
var v = np.load(npyFilename);//NDArray
candidates = v
.astype(np.float32)
.ToJaggedArray<float>()
.OfType<float[]>()
.Select(a =>
{
return a.OfType<float>().ToArray();
})
.ToList();
}
The exception is:
Exception thrown: 'System.OverflowException' in NumSharp.dll An unhandled exception of type 'System.OverflowException' occurred in NumSharp.dll Arithmetic operation resulted in an overflow.
How can I workaround this?
The NumSharp package has a limitation if the file is too big. Read the comments/answers below for more explanations. I added one answer with a suggestion for a workaround
However, As a good alternative is to save the data as .npz (refer to: numpy.savez()) and then the following package can do the job:
https://github.com/matajoh/libnpy
Code sample:
NPZInputStream npz = new NPZInputStream(npyFilename);
var keys = npz.Keys();
//var header = npz.Peek(keys[0]);
var t = npz.ReadFloat32(keys[0]);
Debug.Assert(t.DataType == DataType.FLOAT32);
Upvotes: 1
Views: 557
Reputation: 1408
The issue is that the NumSharp data-structure is a heavy RAM consumer and it seems to be the CSharp GC is not aware of what NumSharp is allocating so it reaches the RAM limit very fast.
So, In order to overcome this, I split the input npy file so that every part should not consume more than max memory allocation allowed in C# (2147483591). In my case i split into 5 different files (200k vectors each).
python part to split the large .npy file:
infile = r'C:\temp\input\GIST.1m.npy'
data = np.load(infile)
# create 5 files
incr = int(data.shape[0] / 5)
# the +1 is to handle any leftovers
r = range(0, int(size/incr + 1))
for i in r:
print(i)
start = i * incr
stop = min(start + incr, size)
if(start >= len(data)):
break
np.save(infile.replace('.npy', f'.{i}.npy'), data[start:stop])
Now in CSharp the code looks as follows:
private static void ReadNpyVectorsFromFile(string pathPrefix, out List<float[]> candidates)
{
candidates = new List<float[]>();
// TODO:
// For now I am assuming there are 10 files maximum...
// this can be improved by scanning the input folder and
// collecting all the relevant files.
foreach (var i in Enumerable.Range(-1, 10))
{
var npyFilename = @$"{pathPrefix}.{i}.npy";
Console.WriteLine(npyFilename);
if (!File.Exists(npyFilename))
continue;
var v = np.load(npyFilename); //NDArray
var tempList = v
.astype(np.float32)
.ToJaggedArray<float>()
.OfType<float[]>()
.Select(a => { return a.OfType<float>().ToArray(); })
.ToList();
candidates.AddRange(tempList);
}
}
Upvotes: 1
Reputation: 670
I see that you've already found a workaround. Just in case you want to now the cause of your problem, it is because of a limitation of the Array
class in .NET.
The np.load(string path)
method is defined here, which in turn calls np.load(Stream stream)
.
int bytes;
Type type;
int[] shape;
if (!parseReader(reader, out bytes, out type, out shape))
throw new FormatException();
Array array = Arrays.Create(type, shape.Aggregate((dims, dim) => dims * dim));
var result = new NDArray(readValueMatrix(reader, array, bytes, type, shape));
return result.reshape(shape);
Here, bytes
is the size of your date type. Because you are using float
, this value is 4
. And shape
is the number of vectors and the shape of them.
Next, let's look at the readValueMatrix
method.
int total = 1;
for (int i = 0; i < shape.Length; i++)
total *= shape[i];
var buffer = new byte[bytes * total];
// omitted
NumSharp is trying to create a one-dimensional byte
array with size equals bytes * total
. Here, bytes
is 4 and total
is the number of vectors multiple by size of all dimensions.
However, in .NET, the maximum index in any given dimension of a byte
array is 0X7FFFFFC7
, which is 2147483591
, as documented here. I haven't downloaded your data yet, but my guess is it is big enough that bytes * total > 2147483591
.
Note that if you want to use NumSharp to write you data back to npy file then you will have the same problem inside writeValueMatrix
method.
Upvotes: 1