Reputation: 3412
I am working with a big numpy matrix (approximately 75k rows of 2 integers each) from which I have to delete some rows. I would like to know if there is a fast way to delete a row without regenerating the whole array i.e. is there a function the change just the "mask" (or whatever is called) of the matrix, without effectively delete the row in memory? I could then regenerate a clean matrix after I delete all the proper rows.
Upvotes: 1
Views: 2642
Reputation: 86330
Although masked arrays are a thing, I would probably do this with a separate boolean mask, e.g.
big_array = np.random.rand(75000, 2)
rows_to_delete = np.random.randint(0, 75000, 500)
mask = np.ones(75000, dtype=bool)
mask[rows_to_delete] = False
output = big_array[mask]
print(output.shape)
# (74503, 2)
If you just have a list of indices to delete, the np.delete
function is also an option:
output = np.delete(big_array, rows_to_delete, axis=0)
print(output.shape)
# (74503, 2)
Note that in either of these options, it is a new array that is returned, not a view of the original array.
Upvotes: 5
Reputation: 231395
The fast way to select rows from an array is with a slice, which produces a view
. But that requires a regular pattern like 'every-nth' row. Any other select produces a copy.
x[::10,:] # view
x[[1,3,6,10,20],:] # copy
x[[True,False,False,True,False,...],:] # copy
np.delete
lets you specify which rows to remove, but it ends up, one or other, making a copy that contains the remaining rows. It's a complex function using different methods depending on what you specify. But in many cases it constructs a mask as @jakevdp demonstrates.
So the fastest way to delete a bunch of rows is to delete them (or select their complement) all at once. Deleting one at a time is the slow way.
Upvotes: 4