user1315621
user1315621

Reputation: 3412

Fast row deletion in numpy

I am working with a big numpy matrix (approximately 75k rows of 2 integers each) from which I have to delete some rows. I would like to know if there is a fast way to delete a row without regenerating the whole array i.e. is there a function the change just the "mask" (or whatever is called) of the matrix, without effectively delete the row in memory? I could then regenerate a clean matrix after I delete all the proper rows.

Upvotes: 1

Views: 2642

Answers (2)

jakevdp
jakevdp

Reputation: 86330

Although masked arrays are a thing, I would probably do this with a separate boolean mask, e.g.

big_array = np.random.rand(75000, 2)

rows_to_delete = np.random.randint(0, 75000, 500)
mask = np.ones(75000, dtype=bool)
mask[rows_to_delete] = False

output = big_array[mask]
print(output.shape)
# (74503, 2)

If you just have a list of indices to delete, the np.delete function is also an option:

output = np.delete(big_array, rows_to_delete, axis=0)
print(output.shape)
# (74503, 2)

Note that in either of these options, it is a new array that is returned, not a view of the original array.

Upvotes: 5

hpaulj
hpaulj

Reputation: 231395

The fast way to select rows from an array is with a slice, which produces a view. But that requires a regular pattern like 'every-nth' row. Any other select produces a copy.

x[::10,:]   # view
x[[1,3,6,10,20],:]   # copy
x[[True,False,False,True,False,...],:]   # copy

np.delete lets you specify which rows to remove, but it ends up, one or other, making a copy that contains the remaining rows. It's a complex function using different methods depending on what you specify. But in many cases it constructs a mask as @jakevdp demonstrates.

So the fastest way to delete a bunch of rows is to delete them (or select their complement) all at once. Deleting one at a time is the slow way.

Upvotes: 4

Related Questions