asher
asher

Reputation: 304

Very slow filtering with multiple conditions in pandas dataframe

UPDATE: I have edited the question (and code) to make the problem clearer. I use here synthetic data but imagine a large df of floods and a small one of significant floods. I want add a reference to every row (of the large_df) if it is somewhat close to the significant flood.

I have 2 pandas dataframes (1 large and 1 small). In every iteration I want to create a subset of the small dataframe based on a few conditions that are dependent on each row (of the large df):

import numpy as np
import pandas as pd
import time

SOME_THRESHOLD = 10.5

MUMBER_OF_ROWS = 2e4
large_df = pd.DataFrame(index=np.arange(MUMBER_OF_ROWS), data={'a': np.arange(MUMBER_OF_ROWS)})
small_df = large_df.loc[np.random.randint(0, MUMBER_OF_ROWS, 5)]
large_df['past_index'] = None

count_time = 0
for ind, row in large_df.iterrows():
  start = time.time()
  # This line takes forever.
  df_tmp = small_df[(small_df.index<ind) & (small_df['a']>(row['a']-SOME_THRESHOLD)) & (small_df['a']<(row['a']+SOME_THRESHOLD))]
  count_time += time.time()-start
  if not df_tmp.empty:
    past_index = df_tmp.loc[df_tmp.index.max()]['a']
    large_df.loc[ind, 'similar_past_flood_tag'] = f'Similar to the large flood of {past_index}'

print(f'The total time of creating the subset df for 2e4 rows is: {count_time} seconds.')

The line that creates the subset takes a long time to compute:

The total time of creating the subset df for 2e4 rows is: 18.276793956756592 seconds.

This seems to me to be an too long. I have found similar questions but non of the answers seemed to work (e.g query and numpy conditions). Is there a way to optimize this?

Note: the code does what is expected - just very slow.

Upvotes: 0

Views: 1370

Answers (2)

ErnestScribbler
ErnestScribbler

Reputation: 2967

In pandas for loops are much slower than column operations. So changing the calculation to loop over small_df instead of large_df will already give a big improvement:

for ind, row in small_df.iterrows():
    df_tmp = large_df[ <some condition> ]
    # ... some other processing

Even better for your case is to use a merge rather than a condition on large_df. The problem is your merge is not on equal columns but on approximately equal. To use this approach, you should truncate your column and use that for the merge. Here's a hacky example.

small_df['a_rounded'] = (small_df['a'] / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded'] = (large_df['a'] / SOME_THRESHOLD / 2).astype(int)
merge_result = small_df.merge(large_df, on='a_rounded')

small_df['a_rounded2'] = ((small_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded2'] = ((large_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
merge_result2 = small_df.merge(large_df, on='a_rounded2')

total_merge_result = pd.concat([merge_result, merge_result2])
# Now remove duplicates and impose additional filters.

You can impose the additional filters on the result later.

Upvotes: 2

Zach Moshe
Zach Moshe

Reputation: 2980

While your code is logically correct, building the many boolean arrays and slicing the DataFrame accumulates to some time..

Here are some stats with %timeit:

  • (small_df.index<ind): ~30μs
  • (small_df['a']>(row['a']-SOME_THRESHOLD)): ~100μs
  • (small_df['a']<(row['a']+SOME_THRESHOLD)): ~100μs
  • After '&'-ing all three: ~500μs
  • Including the DataFrame slice: ~700μs

That, multiplied by 20K times is indeed 14 seconds.. :)

What you could do is take advantage of numpy's broadcast to compute the boolean matrix more efficiently, and then reconstruct the "valid" DataFrame. See below:

l_ind = np.array(large_df.index)
s_ind = np.array(small_df.index)
l_a = np.array(large_df.a)
s_a = np.array(small_df.a)

arr1 = (l_ind[:, None] < s_ind[None, :]) 
arr2 = (((l_a[:, None] - SOME_THRESHOLD) < s_a[None, :]) & 
        (s_a[None, :] < (l_a[:, None] + SOME_THRESHOLD)))

arr = arr1 & arr2
large_valid_inds, small_valid_inds = np.where(arr)

pd.DataFrame({'large_ind': np.take(l_ind, large_valid_inds), 
              'small_ind': np.take(s_ind, small_valid_inds)})

That gives you the following DF, which if I understood the question properly, is the expected solution:

large_ind small_ind
0 1621 1631
1 1622 1631
2 1623 1631
3 1624 1631
4 1625 1631
5 1626 1631
6 1627 1631
7 1628 1631
8 1629 1631
9 1630 1631
10 1992 2002
11 1993 2002
12 1994 2002
13 1995 2002
14 1996 2002
15 1997 2002
16 1998 2002
17 1999 2002
18 2000 2002
19 2001 2002
20 8751 8761
21 8752 8761
22 8753 8761
23 8754 8761
24 8755 8761
25 8756 8761
26 8757 8761
27 8758 8761
28 8759 8761
29 8760 8761
30 10516 10526
31 10517 10526
32 10518 10526
33 10519 10526
34 10520 10526
35 10521 10526
36 10522 10526
37 10523 10526
38 10524 10526
39 10525 10526
40 18448 18458
41 18449 18458
42 18450 18458
43 18451 18458
44 18452 18458
45 18453 18458
46 18454 18458
47 18455 18458
48 18456 18458
49 18457 18458

Upvotes: 2

Related Questions