Randomly select rows from numpy array based on a condition

Question

Let's say I have 2 arrays of arrays, labels is 1D and data is 5D note that both arrays have the same first dimension.

To simplify things let's say labels contain only 3 arrays :

labels=np.array([[0,0,0,1,1,2,0,0],[0,4,0,0,0],[0,3,0,2,1,0,0,1,7,0]])

And let's say I have a datalist of data arrays (length=3) where each array has a 5D shape where the first dimension of each one is the same as the arrays of the labels array.

In this example, datalist has 3 arrays of shapes : (8,3,100,10,1), (5,3,100,10,1) and (10,3,100,10,1) respectively. Here, the first dimension of each of these arrays is the same as the lengths of each array in label.

Now I want to reduce the number of zeros in each array of labels and keep the other values. Let's say I want to keep only 3 zeros for each array. Therefore, the length of each array in labels as well as the first dimension of each array in data will be 6, 4 and 8.

In order to reduce the number of zeros in each array of labels, I want to randomly select and keep only 3. Now these same random selected indexes will be used then to select the correspondant rows from data.

For this example, the new_labels array will be something like this :

new_labels=np.array([[0,0,1,1,2,0],[4,0,0,0],[0,3,2,1,0,1,7,0]])

Here's what I have tried so far :

all_ind=[] #to store indexes where value=0 for all arrays
indexes_to_keep=[] #to store the random selected indexes
new_labels=[] #to store the final results

for i in range(len(labels)):
    ind=[] #to store indexes where value=0 for one array
    for j in range(len(labels[i])):
        if (labels[i][j]==0):
            ind.append(j)
    all_ind.append(ind)

for k in range(len(labels)):   
    indexes_to_keep.append(np.random.choice(all_ind[i], 3))
    aux= np.zeros(len(labels[i]) - len(all_ind[i]) + 3)
    ....
    .... 
    Here, how can I fill **aux** with the values ?
    ....
    .... 
    new_labels.append(aux)

Any suggestions ?

mathfux · Accepted Answer

Playing with numpy arrays of different lenghts is not a good idea therefore you are required to iterate each item and perform some method on it. Assuming you want to optimize that method only, masking might work pretty well here:

def specific_choice(x, n):
    '''leaving n random zeros of the list x'''
    x = np.array(x)
    mask = x != 0
    idx = np.flatnonzero(~mask)
    np.random.shuffle(idx) #dynamical change of idx value, quite fast
    idx = idx[:n]
    mask[idx] = True
    return x[mask] # or mask if you need it

Iteration of list is faster than one of array so effective usage would be:

labels = [[0,0,0,1,1,2,0,0],[0,4,0,0,0],[0,3,0,2,1,0,0,1,7,0]]
output = [specific_choice(n, 3) for n in labels]

Output:

[array([0, 1, 1, 2, 0, 0]), array([0, 4, 0, 0]), array([0, 3, 0, 2, 1, 1, 7, 0])]

Randomly select rows from numpy array based on a condition

Answers (1)

Related Questions