Samer Aamar
Samer Aamar

Reputation: 1408

CSharp: Failed to read LARGE .npy file. Exception is "NumSharp.dll Arithmetic operation resulted in an overflow."

I am trying to read a large .npy file in CSharp. In order to do that i am trying to use the NumSharp nuget.

The file is 7GB jagged float array (float[][]). It has ~1 million vectors, each vector is a 960 dimension.

Note: To be more specific the data I use is the GIST from the following link Approximate Nearest Neighbors Large datasets.

The following is the method I use to load the data but it failes with an exception:

    private static void ReadNpyVectorsFromFile(string pathPrefix, out List<float[]> candidates)
    {
        var npyFilename = @$"{pathPrefix}.npy";
        
        var v = np.load(npyFilename);//NDArray
        
        candidates = v
            .astype(np.float32)
            .ToJaggedArray<float>()
            .OfType<float[]>()
            .Select(a =>
            {
                return a.OfType<float>().ToArray();
            })
            .ToList();
    }

The exception is:

Exception thrown: 'System.OverflowException' in NumSharp.dll An unhandled exception of type 'System.OverflowException' occurred in NumSharp.dll Arithmetic operation resulted in an overflow.

How can I workaround this?


Update

The NumSharp package has a limitation if the file is too big. Read the comments/answers below for more explanations. I added one answer with a suggestion for a workaround

However, As a good alternative is to save the data as .npz (refer to: numpy.savez()) and then the following package can do the job:

https://github.com/matajoh/libnpy

Code sample:

        NPZInputStream npz = new NPZInputStream(npyFilename);
        var keys = npz.Keys();
        //var header = npz.Peek(keys[0]);
        var t = npz.ReadFloat32(keys[0]);

        Debug.Assert(t.DataType == DataType.FLOAT32);

Upvotes: 1

Views: 557

Answers (2)

Samer Aamar
Samer Aamar

Reputation: 1408

The issue is that the NumSharp data-structure is a heavy RAM consumer and it seems to be the CSharp GC is not aware of what NumSharp is allocating so it reaches the RAM limit very fast.

So, In order to overcome this, I split the input npy file so that every part should not consume more than max memory allocation allowed in C# (2147483591). In my case i split into 5 different files (200k vectors each).

python part to split the large .npy file:

 infile = r'C:\temp\input\GIST.1m.npy'
 data = np.load(infile)

 # create 5 files
 incr = int(data.shape[0] / 5) 

 # the +1 is to handle any leftovers
 r = range(0,  int(size/incr + 1)) 

 for i in r:
    print(i)

    start = i * incr
    stop = min(start + incr, size)

    if(start >= len(data)):
        break

    np.save(infile.replace('.npy', f'.{i}.npy'), data[start:stop])

Now in CSharp the code looks as follows:

    private static void ReadNpyVectorsFromFile(string pathPrefix, out List<float[]> candidates)
    {
        candidates = new List<float[]>();

        // TODO: 
        // For now I am assuming there are 10 files maximum... 
        // this can be improved by scanning the input folder and 
        // collecting all the relevant files.
        foreach (var i in Enumerable.Range(-1, 10))
        {
            var npyFilename = @$"{pathPrefix}.{i}.npy";
            Console.WriteLine(npyFilename);

            if (!File.Exists(npyFilename))
                continue;

            var v = np.load(npyFilename); //NDArray

            var tempList = v
                .astype(np.float32)
                .ToJaggedArray<float>()
                .OfType<float[]>()
                .Select(a => { return a.OfType<float>().ToArray(); })
                .ToList();

            candidates.AddRange(tempList);
        }
    }

Upvotes: 1

duongntbk
duongntbk

Reputation: 670

I see that you've already found a workaround. Just in case you want to now the cause of your problem, it is because of a limitation of the Array class in .NET.

The np.load(string path) method is defined here, which in turn calls np.load(Stream stream).

int bytes;
Type type;
int[] shape;
if (!parseReader(reader, out bytes, out type, out shape))
    throw new FormatException();

Array array = Arrays.Create(type, shape.Aggregate((dims, dim) => dims * dim));

var result = new NDArray(readValueMatrix(reader, array, bytes, type, shape));
return result.reshape(shape);

Here, bytes is the size of your date type. Because you are using float, this value is 4. And shape is the number of vectors and the shape of them.

Next, let's look at the readValueMatrix method.

int total = 1;
for (int i = 0; i < shape.Length; i++)
    total *= shape[i];
var buffer = new byte[bytes * total];
// omitted

NumSharp is trying to create a one-dimensional byte array with size equals bytes * total. Here, bytes is 4 and total is the number of vectors multiple by size of all dimensions.

However, in .NET, the maximum index in any given dimension of a byte array is 0X7FFFFFC7, which is 2147483591, as documented here. I haven't downloaded your data yet, but my guess is it is big enough that bytes * total > 2147483591.

Note that if you want to use NumSharp to write you data back to npy file then you will have the same problem inside writeValueMatrix method.

Upvotes: 1

Related Questions