selecting different columns each row

Question

I have a dataframe which has 500K rows and 7 columns for days and include start and end day.

I search a value(like equal 0) in range(startDay, endDay)

Such as, for id_1, startDay=1, and endDay=7, so, I should seek a value D1 to D7 columns.

For id_2, startDay=4, and endDay=7, so, I should seek a value D4 to D7 columns. However, I couldn't seek different column range successfully.

Above-mentioned,

if startDay > endDay, I should see "-999"
else, I need to find first zero (consider the day range) and such as for id_3's, first zero in D2 column(day 2). And starDay of id_3 is 1. And I want to see, 2-1=1 (D2 - StartDay)
if I cannot find 0, I want to see "8"

Here is my data;

data = {
        'D1':[0,1,1,0,1,1,0,0,0,1],
        'D2':[2,0,0,1,2,2,1,2,0,4],
        'D3':[0,0,1,0,1,1,1,0,1,0],
        'D4':[3,3,3,1,3,2,3,0,3,3],
        'D5':[0,0,3,3,4,0,4,2,3,1],
        'D6':[2,1,1,0,3,2,1,2,2,1],
        'D7':[2,3,0,0,3,1,3,2,1,3],
        'startDay':[1,4,1,1,3,3,2,2,5,2],
        'endDay':[7,7,6,7,7,7,2,1,7,6]
        }
data_idx = ['id_1','id_2','id_3','id_4','id_5',
            'id_6','id_7','id_8','id_9','id_10']
df = pd.DataFrame(data, index=data_idx)

What I want to see;

df_need = pd.DataFrame([0,1,1,0,8,2,8,-999,8,1], index=data_idx)

Ben.T · Accepted Answer

You can create boolean array to check in each row which 'Dx' column(s) are above 'startDay' and below 'endDay' and the value is equal to 0. For the first two conditions, you can use np.ufunc.outer with the ufunc being np.less_equal and np.greater_equal such as: import numpy as np

arr_bool = ( np.less_equal.outer(df.startDay, range(1,8)) # which columns Dx is above startDay
            & np.greater_equal.outer(df.endDay, range(1,8)) # which columns Dx is under endDay
            & (df.filter(regex='D[0-9]').values == 0)) #which value of the columns Dx are 0

Then you can use np.argmax to find the first True per row. By adding 1 and removing 'startDay', you get the values you are looking for. Then you need to look for the other conditions with np.select to replace values by -999 if df.startDay >= df.endDay or 8 if no True in the row of arr_bool such as:

df_need = pd.DataFrame( (np.argmax(arr_bool , axis=1) + 1  - df.startDay).values, 
                        index=data_idx, columns=['need'])
df_need.need= np.select( condlist = [df.startDay >= df.endDay, ~arr_bool.any(axis=1)],
                         choicelist = [ -999, 8], 
                         default = df_need.need)

print (df_need)
         need
id_1        0
id_2        1
id_3        1
id_4        0
id_5        8
id_6        2
id_7     -999
id_8     -999
id_9        8
id_10       1

One note: to get -999 for id_7, I used the condition df.startDay >= df.endDay in np.select and not df.startDay > df.endDay like in your question, but you can cahnge to strict comparison, you get 8 instead of -999 in this case.

selecting different columns each row

Answers (1)

Related Questions