Reputation: 11

Remove multiple items from a numy.narray without numpy.delete

I am using a large nump.narray (11.000x3180) to develop an active learning algorithm (Text mining). In this algorithm, I have to delete each itarecion 16 samples (row vectors) in my dataset, and then integrate them into training set (it grows at 16 samples per iteration). After performing this process for 60 iterations (approximately), the algorithm is initialized again and again the same process from the beginning for 100 runs

To delete the set of 16 elements in my data set, I use the method numpy.delete (dataset [ListifoIndex], axis = 0), where [ListifoIndex] corresponds to the indices of the selected items to remove.

This method works for the first run (1 of 100), but then initialize the algorithm again, I have the following error:

new = empty(newshape, arr.dtype, arr.flags.fnc)
MemoryError

Apparently the numpy.delete metod creates a copy of my database for each of the indices (16x1.2GB), which exceeds the amount of memory that I have on my computer.

the question is: how I can remove items from a numpy.narray not get to use a lot of memory and without excessive execution times?

PD1: I've done the reverse process, where I add the elements that are not in the index list to remove, but the process is very slow. PD2: Sometimes the error occurs before initializing the algorithm (before the iteration number 60)

Upvotes: 1

Answers (3)

Jared Taylor

Reputation: 21

I know this is old, but I ran into the same problem and wanted to share the fix here. You are sort of correct when you say that numpy.delete keeps a copy of the database, but it isn't numpy, its python itself.

Say you randomly choose an row from the database to be part of the training set. Instead of taking the row, python will take the reference of the row and keep the whole database for when you next want to use that row. In this way, when you delete the row from the old database, you create a new database where you can choose another row. That database gets saved as well because it is referenced as the next row in the training set. 100 iterations later you end up with 100 copies of the database, each having 1 less row than the last, but containing the same data.

The solution I found instead of appending the row to the training set, making a copy using copy.deepcopy to pull the row from the array and putting it in the training set. This way python doesn't need to carry the old database for reference purposes.

Bad -

database = [0,1,2,3,4,5,6]
Train = []
for i in range(len(database)):
    Train.append(database[i])

Good -

for i in range(len(database)):
    copy_of_thing = copy.deepcopy(database[i])
    Train.append(copy_of_thing)

Upvotes: 2

hpaulj

Reputation: 231625

It may help to understand exactly what np.delete does. In your case

newset = np.delete(dataset, ListifoIndex, axis = 0)  # corrected

in essence it does:

keep = np.ones(dataset.shape[0], dtype=bool) # array of True matching 1st dim
keep[ListifoIndex] = False
newset = dataset[keep, :]

In other words, it constructs a boolean index of the rows it wants to keep.

If I run

dataset = np.delete(dataset, ListifoIndex, axis = 0)

repeatedly in an interactive shell, there isn't any accumulation of intermediate arrays. Temporarily while running delete there will be this keep array, and a new copy of dataset. But with assignment, the old copy disappears.

Are you sure it's the delete that's growing memory use, as opposed to growing the training set?

As for speed, you might improve that by maintaining a 'mask' of all 'delete' rows, rather than actually deleting anything. But depending on how ListifoIndex overlaps with previous deletions, updating that mask might be more trouble than it's worth. It's also likely to be more error prone.

Upvotes: 3

HYRY

Reputation: 97331

If the order doesn't metter, you can swap the rows to delete to the end of the array:

import numpy as np

n = 1000
a = np.random.rand(n, 8)
a[:, 0] = np.arange(n)
del_index = np.array([10, 100, 200, 500, 800, 995, 997, 999])
del_index2 = del_index[del_index < len(a) - len(del_index)]

copy_index = np.arange(len(a) - len(del_index), len(a))
copy_index2 = np.setdiff1d(copy_index, del_index)
a[copy_index2], a[del_index2] = a[del_index2], a[copy_index2]

and then you can use slice to create a new view:

    a2 = a[:-len(del_index)]

If you want to keep the order, you can use for loop and slice copy:

import numpy as np

n = 1000
a = np.random.rand(n, 8)
a[:, 0] = np.arange(n)
a2 = np.delete(a, del_index, axis=0)
del_index = np.array([100, 10, 200, 500, 800, 995, 997, 999])
del_index.sort()

for i, (start, end) in enumerate(zip(del_index[:-1], del_index[1:])):
    a[start-i:end-1-i] = a[start+1:end]

print np.all(a[:-8] == a2)

Upvotes: 0

Remove multiple items from a numy.narray without numpy.delete

Answers (3)

Related Questions