numpy.dtype=object very slow compared to numpy.dtype=int

Question

I am entering high values (greater than 2^70) inside numpy arrays using numpy.dtype=object:

numpy.array([1], dtype=numpy.object) << 70
array([1180591620717411303424], dtype=object)

The only reason I am using dtype=numpy.object here is that numpy.int's limit is crossed when you try to enter high values inside it.

numpy.array([1]) << 70
>>> array([64], dtype=int32) #the result should have been array([1180591620717411303424], dtype=object)

The details is expained in my other question here.. In such cases using dtype=object works fine.

But I found that using dtype=numpy.object is very slow.

To verify, I compared the timings of the following operations:

You can see that dtype=int is much faster.

So is there any workaround for entering high values inside numpy arrays?

Benjamin · Accepted Answer

Based on the user's comment:

I am building a bit map index of person vs videos, where I have a lot of videos(more than 1000) for a limited set of people (about 100). So each cell represents whether that person appears in the video(1) or not(0). This table is stored in a file. Now when I want to know if two people appear in a video together or not, then I read the corresponding row for those two people and do a bitwise 'AND' operation to get the result and locate the indices where there is '1'. So do this bitwise 'AND', I need to convert the binary list into integers first and then compute the result.

A much simpler solution would be to load the binary array (dummy data used here, shape = persons x videos), then compare the two rows corresponding to your two persons (say person 10 and person 37) using &, and finally retrieving the video indices where both occur:

my_map = numpy.random.randint(0,2,(100, 1000), numpy.bool)
appear_together_in_video_index = numpy.where(my_map[10] & my_map[37])

This way you are only ever dealing with booleans (8-bits) and avoiding the large number issue completely.

To answer the original question, it is not really a fair comparison. Since your aim is to work with large numbers, you should perform the timing tests between two solutions that support such large numbers. The basic alternative is to hold Python ints in a Python list, which is not faster (nor simpler to handle) than to store them as objects in numpy arrays.

numpy.dtype=object very slow compared to numpy.dtype=int

Answers (1)

Related Questions