How to speed up pandas dataframe iteration

Question

We use pandas dataframe in our project and we realized that our program is very slow due to pandas dataframe's slow calculations. I shared our code with you.

   df_item_in_desc = pd.DataFrame(columns = df.columns) # to hold all satisfied results
    
    for index in df.shape[0]:
        s1 = set(df.iloc[index]['desc_words_short'])
    
        if item_number in s1:   
            df_item_in_desc = df_item_in_desc.append(df.iloc[index])

We check that if item name is in another column desc_words_short then we append that row to another dataframe (df_item_in_desc). This is simple logic but to get such rows we should iterate over all dataframe and check that condition. Our dataframe is a bit large and running this code takes more time. How can we speed up this process, can we use Cpu parallelization in this task, or something else?

Note: We actually tried Cpu parallelization and wouldn't be successful.

Derek Eden · Accepted Answer

so it looks like you're looping through each row and looking at the value of the desc_words_short column. And for each value, if that value (presumably a list) contains item_number then you want to add that row to df_item_in_desc.

If that is the goal, you may be able to speed it up like this:

import pandas as pd

item_number = 'a'
df = pd.DataFrame({'desc_words_short':[['a','a','b'],['b','d'],['c','c']]})

print(df)

  desc_words_short
0        [a, a, b]
1           [b, d]
2           [c, c]

mask = df['desc_words_short'].apply(lambda x: item_number in x)
df_item_in_desc = df.loc[mask]

print(df_item_in_desc)

  desc_words_short
0        [a, a, b]

I'm not sure what the point of set is, as item_number would be in either the full list or the set, so it's a pointless additional computation

How to speed up pandas dataframe iteration

Answers (2)

Related Questions