Reputation: 304
UPDATE: I have edited the question (and code) to make the problem clearer. I use here synthetic data but imagine a large df of floods and a small one of significant floods. I want add a reference to every row (of the large_df) if it is somewhat close to the significant flood.
I have 2 pandas dataframes (1 large and 1 small). In every iteration I want to create a subset of the small dataframe based on a few conditions that are dependent on each row (of the large df):
import numpy as np
import pandas as pd
import time
SOME_THRESHOLD = 10.5
MUMBER_OF_ROWS = 2e4
large_df = pd.DataFrame(index=np.arange(MUMBER_OF_ROWS), data={'a': np.arange(MUMBER_OF_ROWS)})
small_df = large_df.loc[np.random.randint(0, MUMBER_OF_ROWS, 5)]
large_df['past_index'] = None
count_time = 0
for ind, row in large_df.iterrows():
start = time.time()
# This line takes forever.
df_tmp = small_df[(small_df.index<ind) & (small_df['a']>(row['a']-SOME_THRESHOLD)) & (small_df['a']<(row['a']+SOME_THRESHOLD))]
count_time += time.time()-start
if not df_tmp.empty:
past_index = df_tmp.loc[df_tmp.index.max()]['a']
large_df.loc[ind, 'similar_past_flood_tag'] = f'Similar to the large flood of {past_index}'
print(f'The total time of creating the subset df for 2e4 rows is: {count_time} seconds.')
The line that creates the subset takes a long time to compute:
The total time of creating the subset df for 2e4 rows is: 18.276793956756592 seconds.
This seems to me to be an too long. I have found similar questions but non of the answers seemed to work (e.g query and numpy conditions). Is there a way to optimize this?
Note: the code does what is expected - just very slow.
Upvotes: 0
Views: 1370
Reputation: 2967
In pandas for loops are much slower than column operations. So changing the calculation to loop over small_df instead of large_df will already give a big improvement:
for ind, row in small_df.iterrows():
df_tmp = large_df[ <some condition> ]
# ... some other processing
Even better for your case is to use a merge rather than a condition on large_df. The problem is your merge is not on equal columns but on approximately equal. To use this approach, you should truncate your column and use that for the merge. Here's a hacky example.
small_df['a_rounded'] = (small_df['a'] / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded'] = (large_df['a'] / SOME_THRESHOLD / 2).astype(int)
merge_result = small_df.merge(large_df, on='a_rounded')
small_df['a_rounded2'] = ((small_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded2'] = ((large_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
merge_result2 = small_df.merge(large_df, on='a_rounded2')
total_merge_result = pd.concat([merge_result, merge_result2])
# Now remove duplicates and impose additional filters.
You can impose the additional filters on the result later.
Upvotes: 2
Reputation: 2980
While your code is logically correct, building the many boolean arrays and slicing the DataFrame accumulates to some time..
Here are some stats with %timeit
:
That, multiplied by 20K times is indeed 14 seconds.. :)
What you could do is take advantage of numpy's broadcast to compute the boolean matrix more efficiently, and then reconstruct the "valid" DataFrame. See below:
l_ind = np.array(large_df.index)
s_ind = np.array(small_df.index)
l_a = np.array(large_df.a)
s_a = np.array(small_df.a)
arr1 = (l_ind[:, None] < s_ind[None, :])
arr2 = (((l_a[:, None] - SOME_THRESHOLD) < s_a[None, :]) &
(s_a[None, :] < (l_a[:, None] + SOME_THRESHOLD)))
arr = arr1 & arr2
large_valid_inds, small_valid_inds = np.where(arr)
pd.DataFrame({'large_ind': np.take(l_ind, large_valid_inds),
'small_ind': np.take(s_ind, small_valid_inds)})
That gives you the following DF, which if I understood the question properly, is the expected solution:
large_ind | small_ind | |
---|---|---|
0 | 1621 | 1631 |
1 | 1622 | 1631 |
2 | 1623 | 1631 |
3 | 1624 | 1631 |
4 | 1625 | 1631 |
5 | 1626 | 1631 |
6 | 1627 | 1631 |
7 | 1628 | 1631 |
8 | 1629 | 1631 |
9 | 1630 | 1631 |
10 | 1992 | 2002 |
11 | 1993 | 2002 |
12 | 1994 | 2002 |
13 | 1995 | 2002 |
14 | 1996 | 2002 |
15 | 1997 | 2002 |
16 | 1998 | 2002 |
17 | 1999 | 2002 |
18 | 2000 | 2002 |
19 | 2001 | 2002 |
20 | 8751 | 8761 |
21 | 8752 | 8761 |
22 | 8753 | 8761 |
23 | 8754 | 8761 |
24 | 8755 | 8761 |
25 | 8756 | 8761 |
26 | 8757 | 8761 |
27 | 8758 | 8761 |
28 | 8759 | 8761 |
29 | 8760 | 8761 |
30 | 10516 | 10526 |
31 | 10517 | 10526 |
32 | 10518 | 10526 |
33 | 10519 | 10526 |
34 | 10520 | 10526 |
35 | 10521 | 10526 |
36 | 10522 | 10526 |
37 | 10523 | 10526 |
38 | 10524 | 10526 |
39 | 10525 | 10526 |
40 | 18448 | 18458 |
41 | 18449 | 18458 |
42 | 18450 | 18458 |
43 | 18451 | 18458 |
44 | 18452 | 18458 |
45 | 18453 | 18458 |
46 | 18454 | 18458 |
47 | 18455 | 18458 |
48 | 18456 | 18458 |
49 | 18457 | 18458 |
Upvotes: 2