How to conditionally use `pandas.DataFrame.apply` based on values in a certain column?

Question

I have a pandas DataFrame whereby I'm using pandas.DataFrame.apply().

dict1 = {'file': ['filename2', 'filename2', 'filename3', 'filename4', 
         'filename4', 'filename3'], 'amount': [3, 4, 5, 1, 2, 1], 
         'front': [21889611, 36357723, 196312, 11, 42, 1992], 
         'back':[21973805, 36403870, 277500, 19, 120, 3210], 
         'type':['A', 'A', 'A', 'B', 'B', 'C']}

df1 = pd.DataFrame(dict1)
print(df1)

        file  amount     front      back type
0  filename2       3  21889611  21973805    A
1  filename2       4  36357723  36403870    A
2  filename3       5    196312    277500    A
3  filename4       1        11        19    B
4  filename4       2        42       120    B
5  filename3       1      1992      3210    C

My task is to take N random draws between columns front and back, whereby N is equal to the value in column amount:

def my_func(x):
    return np.random.choice(np.arange(x.front, x.back+1), x.amount).tolist()

I would only like to apply this function on rows whereby type is equal to A.

If I didn't have the conditional, I would use pd.DataFrame.apply() as follows:

df1["new_column"] = df1.apply(my_func, axis=1)

To apply this only if type==A, I could do something like

if set(['A']).issubset(df1['type']):
    df1["new_column"] = df1.apply(my_func, axis=1)

However, this doesn't actually work---my_func is applied to all functions.

(1) Why doesn't the above work as expected?

(2) How would one efficiently only apply my_func to the rows which satisfy type==A? In the example above, this would be for the first three rows, and the next three rows would be NA

Scott Boston · Accepted Answer

Filter your dataframe first then apply my_func. Let's use query:

df1['new_column'] = df1.query('type == "A"').apply(my_func, axis=1)

Output:

   amount      back       file     front type  \
0       3  21973805  filename2  21889611    A   
1       4  36403870  filename2  36357723    A   
2       5    277500  filename3    196312    A   
3       1        19  filename4        11    B   
4       2       120  filename4        42    B   
5       1      3210  filename3      1992    C   

                                 new_column  
0            [21921030, 21908574, 21971743]  
1  [36391053, 36371413, 36394390, 36376405]  
2  [198648, 263355, 197017, 261666, 260815]  
3                                       NaN  
4                                       NaN  
5                                       NaN

How to conditionally use `pandas.DataFrame.apply` based on values in a certain column?

Answers (1)

Related Questions