Abhishek Bhatia
Abhishek Bhatia

Reputation: 9806

Conversion to numpy array crashing RAM

I have a list of numpy arrays. The list has 200000 elements and each array is of size 3504. This works fine in my RAM. type(x)

(Pdb) type(x)
<type 'list'>
(Pdb) len(x)
200001
(Pdb) type(x[1])
<type 'numpy.ndarray'>
(Pdb) x[1].shape
(3504L,)

The problem is that now I convert the list to numpy array and it exceeds by RAM 100% usage and freezes/crashes my PC. My intent to convert is to perform some feature scaling and PCA.

EDIT: I want to convert each sample to concatenate array of earlier 1000 samples plus itself.

def take_previous_data(X_train,y):
    temp_train_data=X_train[1000:]
    temp_labels=y[1000:] 
    final_train_set=[]
    for index,row in enumerate(temp_train_data):
        actual_index=index+1000
        data=X_train[actual_index-1000:actual_index+1].ravel()
        __,cd_i=pywt.dwt(data,'haar')
        final_train_set.append(cd_i)
    return final_train_set,y


x,y=take_previous_data(X_train,y)

Upvotes: 2

Views: 2986

Answers (1)

ali_m
ali_m

Reputation: 74172

You could try rewriting take_previous_data as a generator function that lazily yields rows of your final array, then use np.fromiter, as Eli suggested:

from itertools import chain

def take_previous_data(X_train,y):
    temp_train_data=X_train[1000:]
    temp_labels=y[1000:] 
    for index,row in enumerate(temp_train_data):
        actual_index=index+1000
        data=X_train[actual_index-1000:actual_index+1].ravel()
        __,cd_i=pywt.dwt(data,'haar')
        yield cd_i

gen = take_previous_data(X_train, y)

# I'm assuming that by "int" you meant "int64"
x = np.fromiter(chain.from_iterable(gen), np.int64)

# fromiter gives a 1D output, so we reshape it into a (200001, 3504) array
x.shape = 200001, -1

Another option would be to pre-allocate the output array and fill in the rows as you go along:

def take_previous_data(X_train, y):
    temp_train_data=X_train[1000:]
    temp_labels=y[1000:] 
    out = np.empty((200001, 3504), np.int64)
    for index,row in enumerate(temp_train_data):
        actual_index=index+1000
        data=X_train[actual_index-1000:actual_index+1].ravel()
        __,cd_i=pywt.dwt(data,'haar')
        out[index] = cd_i
    return out

From our chat conversation, it seems that the fundamental issue is that you can't actually fit the output array itself in memory. In that case, you could adapt the second solution to use np.memmap to write the output array to disk:

def take_previous_data(X_train, y):
    temp_train_data=X_train[1000:]
    temp_labels=y[1000:] 
    out = np.memmap('my_array.mmap', 'w+', shape=(200001, 3504), dtype=np.int64)
    for index,row in enumerate(temp_train_data):
        actual_index=index+1000
        data=X_train[actual_index-1000:actual_index+1].ravel()
        __,cd_i=pywt.dwt(data,'haar')
        out[index] = cd_i
    return out

One other obvious solution would be to reduce the bit depth of your array. I've assumed that by int you meant int64 (the default integer type in numpy). f you could switch to a lower bit depth (e.g. int32, int16 or maybe even int8), you could drastically reduce your memory requirements.

Upvotes: 2

Related Questions