Reputation: 11
I am using a large nump.narray (11.000x3180) to develop an active learning algorithm (Text mining). In this algorithm, I have to delete each itarecion 16 samples (row vectors) in my dataset, and then integrate them into training set (it grows at 16 samples per iteration). After performing this process for 60 iterations (approximately), the algorithm is initialized again and again the same process from the beginning for 100 runs
To delete the set of 16 elements in my data set, I use the method
numpy.delete (dataset [ListifoIndex], axis = 0)
, where [ListifoIndex]
corresponds to the indices of the selected items to remove.
This method works for the first run (1 of 100), but then initialize the algorithm again, I have the following error:
new = empty(newshape, arr.dtype, arr.flags.fnc)
MemoryError
Apparently the numpy.delete
metod creates a copy of my database for each of the indices (16x1.2GB), which exceeds the amount of memory that I have on my computer.
the question is: how I can remove items from a numpy.narray not get to use a lot of memory and without excessive execution times?
PD1: I've done the reverse process, where I add the elements that are not in the index list to remove, but the process is very slow. PD2: Sometimes the error occurs before initializing the algorithm (before the iteration number 60)
Upvotes: 1
Views: 3731
Reputation: 21
I know this is old, but I ran into the same problem and wanted to share the fix here. You are sort of correct when you say that numpy.delete
keeps a copy of the database, but it isn't numpy
, its python
itself.
Say you randomly choose an row from the database to be part of the training set. Instead of taking the row, python
will take the reference of the row and keep the whole database for when you next want to use that row. In this way, when you delete the row from the old database, you create a new database where you can choose another row. That database gets saved as well because it is referenced as the next row in the training set. 100 iterations later you end up with 100 copies of the database, each having 1 less row than the last, but containing the same data.
The solution I found instead of appending the row to the training set, making a copy using copy.deepcopy
to pull the row from the array and putting it in the training set. This way python doesn't need to carry the old database for reference purposes.
Bad -
database = [0,1,2,3,4,5,6]
Train = []
for i in range(len(database)):
Train.append(database[i])
Good -
for i in range(len(database)):
copy_of_thing = copy.deepcopy(database[i])
Train.append(copy_of_thing)
Upvotes: 2
Reputation: 231625
It may help to understand exactly what np.delete
does. In your case
newset = np.delete(dataset, ListifoIndex, axis = 0) # corrected
in essence it does:
keep = np.ones(dataset.shape[0], dtype=bool) # array of True matching 1st dim
keep[ListifoIndex] = False
newset = dataset[keep, :]
In other words, it constructs a boolean index of the rows it wants to keep.
If I run
dataset = np.delete(dataset, ListifoIndex, axis = 0)
repeatedly in an interactive shell, there isn't any accumulation of intermediate arrays. Temporarily while running delete
there will be this keep
array, and a new copy of dataset
. But with assignment, the old copy disappears.
Are you sure it's the delete
that's growing memory use, as opposed to growing the training set?
As for speed, you might improve that by maintaining a 'mask' of all 'delete' rows, rather than actually deleting anything. But depending on how ListifoIndex
overlaps with previous deletions, updating that mask might be more trouble than it's worth. It's also likely to be more error prone.
Upvotes: 3
Reputation: 97331
If the order doesn't metter, you can swap the rows to delete to the end of the array:
import numpy as np
n = 1000
a = np.random.rand(n, 8)
a[:, 0] = np.arange(n)
del_index = np.array([10, 100, 200, 500, 800, 995, 997, 999])
del_index2 = del_index[del_index < len(a) - len(del_index)]
copy_index = np.arange(len(a) - len(del_index), len(a))
copy_index2 = np.setdiff1d(copy_index, del_index)
a[copy_index2], a[del_index2] = a[del_index2], a[copy_index2]
and then you can use slice to create a new view:
a2 = a[:-len(del_index)]
If you want to keep the order, you can use for loop and slice copy:
import numpy as np
n = 1000
a = np.random.rand(n, 8)
a[:, 0] = np.arange(n)
a2 = np.delete(a, del_index, axis=0)
del_index = np.array([100, 10, 200, 500, 800, 995, 997, 999])
del_index.sort()
for i, (start, end) in enumerate(zip(del_index[:-1], del_index[1:])):
a[start-i:end-1-i] = a[start+1:end]
print np.all(a[:-8] == a2)
Upvotes: 0