Reputation: 99
We use pandas dataframe in our project and we realized that our program is very slow due to pandas dataframe's slow calculations. I shared our code with you.
df_item_in_desc = pd.DataFrame(columns = df.columns) # to hold all satisfied results
for index in df.shape[0]:
s1 = set(df.iloc[index]['desc_words_short'])
if item_number in s1:
df_item_in_desc = df_item_in_desc.append(df.iloc[index])
We check that if item name is in another column desc_words_short
then we append that row to another dataframe (df_item_in_desc
). This is simple logic but to get such rows we should iterate over all dataframe and check that condition. Our dataframe is a bit large and running this code takes more time. How can we speed up this process, can we use Cpu parallelization
in this task, or something else?
Note: We actually tried Cpu parallelization and wouldn't be successful.
Upvotes: 2
Views: 1498
Reputation: 1979
You can also use list comprehension. We should avoid using df.apply
and have it as a last resort.
On larger datasets, list comprehension will be faster. Benchmarks in answer here: link. The answer itself is a gem of wisdom.
Quoting the benchmark:
%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)] %timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]] 2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
import pandas as pd
item_number = 'a'
df = pd.DataFrame({'desc_words_short':[['a','a','b'],['b','d'],['c','c']]})
df[[ item_number in x for x in df['desc_words_short']]]
Dataframe:
desc_words_short
0 [a, a, b]
1 [b, d]
2 [c, c]
Output:
desc_words_short
0 [a, a, b]
Reference: https://stackoverflow.com/a/54432584/6741053
Upvotes: 1
Reputation: 4618
so it looks like you're looping through each row and looking at the value of the desc_words_short
column. And for each value, if that value (presumably a list) contains item_number
then you want to add that row to df_item_in_desc
.
If that is the goal, you may be able to speed it up like this:
import pandas as pd
item_number = 'a'
df = pd.DataFrame({'desc_words_short':[['a','a','b'],['b','d'],['c','c']]})
print(df)
desc_words_short
0 [a, a, b]
1 [b, d]
2 [c, c]
mask = df['desc_words_short'].apply(lambda x: item_number in x)
df_item_in_desc = df.loc[mask]
print(df_item_in_desc)
desc_words_short
0 [a, a, b]
I'm not sure what the point of set
is, as item_number
would be in either the full list or the set, so it's a pointless additional computation
Upvotes: 1