How to speed up pandas dataframe.apply() for large data

Question

def func(row):
if row.GT_x == row.GT_y or row.GT_x == row.GT_y[::-1]:
    return 2
elif len(set(row.GT_x) & set(row.GT_y)) != 0:
    return 1
else:
    return 0

%%timeit
merged_df['Decision'] = merged_df.apply(func, axis=1)

1 loop, best of 3: 30.2 s per loop

I'm going to apply "func" for all dataframe rows and the number of row is approximately 650,000.

I guess pandas.apply() takes more time than iterating by for loop.

I also tried lambda function rather than "func", but the result is same.

my dataframe has two columns named GT_x, GT_y and, it has "AA" or "BB". Function "func" detect GT_x and GT_y is same, it return 2, if one of them matches, return 1, else return 0.

And, I'm gonna make another column(Decision) by using apply function "func"

Could you recommend another faster method?

+

Here's sample data I have

GT_x    GT_y

0 AG GA

1 AA GA

2 AA GG

3 GG GG

...

65000 GG GG

index 0 result should be 2, index 1 result should be 1, index 2 result should be 0, also index 3 and 65,000 result should be 2

How to speed up pandas dataframe.apply() for large data

Answers (1)

Related Questions