DanielKwon
DanielKwon

Reputation: 21

How to speed up pandas dataframe.apply() for large data

def func(row):
if row.GT_x == row.GT_y or row.GT_x == row.GT_y[::-1]:
    return 2
elif len(set(row.GT_x) & set(row.GT_y)) != 0:
    return 1
else:
    return 0

%%timeit
merged_df['Decision'] = merged_df.apply(func, axis=1)

1 loop, best of 3: 30.2 s per loop

I'm going to apply "func" for all dataframe rows and the number of row is approximately 650,000.

I guess pandas.apply() takes more time than iterating by for loop.

I also tried lambda function rather than "func", but the result is same.

my dataframe has two columns named GT_x, GT_y and, it has "AA" or "BB". Function "func" detect GT_x and GT_y is same, it return 2, if one of them matches, return 1, else return 0.

And, I'm gonna make another column(Decision) by using apply function "func"

Could you recommend another faster method?

+

Here's sample data I have

GT_x    GT_y

0 AG GA

1 AA GA

2 AA GG

3 GG GG

...

65000 GG GG

index 0 result should be 2, index 1 result should be 1, index 2 result should be 0, also index 3 and 65,000 result should be 2

Upvotes: 2

Views: 2715

Answers (1)

Alex Ozerov
Alex Ozerov

Reputation: 1028

you can use df.apply(func, axis=1, raw=True) for faster computations (in that case input of your function will be raw numpy array instead of series)

from apply function description:

raw : boolean, default False
If False, convert each row or column into a Series. If raw=True the 
passed function will receive ndarray objects instead. If you are just a 
applying a NumPy reduction function this will achieve much better 
performance

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html

Upvotes: 1

Related Questions