Ali
Ali

Reputation: 99

How to speed up pandas dataframe iteration

We use pandas dataframe in our project and we realized that our program is very slow due to pandas dataframe's slow calculations. I shared our code with you.

   df_item_in_desc = pd.DataFrame(columns = df.columns) # to hold all satisfied results
    
    for index in df.shape[0]:
        s1 = set(df.iloc[index]['desc_words_short'])
    
        if item_number in s1:   
            df_item_in_desc = df_item_in_desc.append(df.iloc[index])

We check that if item name is in another column desc_words_short then we append that row to another dataframe (df_item_in_desc). This is simple logic but to get such rows we should iterate over all dataframe and check that condition. Our dataframe is a bit large and running this code takes more time. How can we speed up this process, can we use Cpu parallelization in this task, or something else?

Note: We actually tried Cpu parallelization and wouldn't be successful.

Upvotes: 2

Views: 1498

Answers (2)

รยקคгรђשค
รยקคгรђשค

Reputation: 1979

You can also use list comprehension. We should avoid using df.apply and have it as a last resort.

On larger datasets, list comprehension will be faster. Benchmarks in answer here: link. The answer itself is a gem of wisdom.

Quoting the benchmark:

%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
import pandas as pd
item_number = 'a'
df = pd.DataFrame({'desc_words_short':[['a','a','b'],['b','d'],['c','c']]})
df[[ item_number in x for x in df['desc_words_short']]]

Dataframe:

  desc_words_short
0        [a, a, b]
1           [b, d]
2           [c, c]

Output:

  desc_words_short
0        [a, a, b]

Reference: https://stackoverflow.com/a/54432584/6741053

Upvotes: 1

Derek Eden
Derek Eden

Reputation: 4618

so it looks like you're looping through each row and looking at the value of the desc_words_short column. And for each value, if that value (presumably a list) contains item_number then you want to add that row to df_item_in_desc.

If that is the goal, you may be able to speed it up like this:

import pandas as pd

item_number = 'a'
df = pd.DataFrame({'desc_words_short':[['a','a','b'],['b','d'],['c','c']]})

print(df)

  desc_words_short
0        [a, a, b]
1           [b, d]
2           [c, c]

mask = df['desc_words_short'].apply(lambda x: item_number in x)
df_item_in_desc = df.loc[mask]

print(df_item_in_desc)

  desc_words_short
0        [a, a, b]

I'm not sure what the point of set is, as item_number would be in either the full list or the set, so it's a pointless additional computation

Upvotes: 1

Related Questions