Reputation: 304

Very slow filtering with multiple conditions in pandas dataframe

UPDATE: I have edited the question (and code) to make the problem clearer. I use here synthetic data but imagine a large df of floods and a small one of significant floods. I want add a reference to every row (of the large_df) if it is somewhat close to the significant flood.

I have 2 pandas dataframes (1 large and 1 small). In every iteration I want to create a subset of the small dataframe based on a few conditions that are dependent on each row (of the large df):

import numpy as np
import pandas as pd
import time

SOME_THRESHOLD = 10.5

MUMBER_OF_ROWS = 2e4
large_df = pd.DataFrame(index=np.arange(MUMBER_OF_ROWS), data={'a': np.arange(MUMBER_OF_ROWS)})
small_df = large_df.loc[np.random.randint(0, MUMBER_OF_ROWS, 5)]
large_df['past_index'] = None

count_time = 0
for ind, row in large_df.iterrows():
  start = time.time()
  # This line takes forever.
  df_tmp = small_df[(small_df.index<ind) & (small_df['a']>(row['a']-SOME_THRESHOLD)) & (small_df['a']<(row['a']+SOME_THRESHOLD))]
  count_time += time.time()-start
  if not df_tmp.empty:
    past_index = df_tmp.loc[df_tmp.index.max()]['a']
    large_df.loc[ind, 'similar_past_flood_tag'] = f'Similar to the large flood of {past_index}'

print(f'The total time of creating the subset df for 2e4 rows is: {count_time} seconds.')

The line that creates the subset takes a long time to compute:

The total time of creating the subset df for 2e4 rows is: 18.276793956756592 seconds.

This seems to me to be an too long. I have found similar questions but non of the answers seemed to work (e.g query and numpy conditions). Is there a way to optimize this?

Note: the code does what is expected - just very slow.

Upvotes: 0

Answers (2)

ErnestScribbler

Reputation: 2967

In pandas for loops are much slower than column operations. So changing the calculation to loop over small_df instead of large_df will already give a big improvement:

for ind, row in small_df.iterrows():
    df_tmp = large_df[ <some condition> ]
    # ... some other processing

Even better for your case is to use a merge rather than a condition on large_df. The problem is your merge is not on equal columns but on approximately equal. To use this approach, you should truncate your column and use that for the merge. Here's a hacky example.

small_df['a_rounded'] = (small_df['a'] / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded'] = (large_df['a'] / SOME_THRESHOLD / 2).astype(int)
merge_result = small_df.merge(large_df, on='a_rounded')

small_df['a_rounded2'] = ((small_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
large_df['a_rounded2'] = ((large_df['a'] + SOME_THRESHOLD) / SOME_THRESHOLD / 2).astype(int)
merge_result2 = small_df.merge(large_df, on='a_rounded2')

total_merge_result = pd.concat([merge_result, merge_result2])
# Now remove duplicates and impose additional filters.

You can impose the additional filters on the result later.

Upvotes: 2

Zach Moshe

Reputation: 2980

While your code is logically correct, building the many boolean arrays and slicing the DataFrame accumulates to some time..

Here are some stats with %timeit:

(small_df.index<ind): ~30μs
(small_df['a']>(row['a']-SOME_THRESHOLD)): ~100μs
(small_df['a']<(row['a']+SOME_THRESHOLD)): ~100μs
After '&'-ing all three: ~500μs
Including the DataFrame slice: ~700μs

That, multiplied by 20K times is indeed 14 seconds.. :)

What you could do is take advantage of numpy's broadcast to compute the boolean matrix more efficiently, and then reconstruct the "valid" DataFrame. See below:

l_ind = np.array(large_df.index)
s_ind = np.array(small_df.index)
l_a = np.array(large_df.a)
s_a = np.array(small_df.a)

arr1 = (l_ind[:, None] < s_ind[None, :]) 
arr2 = (((l_a[:, None] - SOME_THRESHOLD) < s_a[None, :]) & 
        (s_a[None, :] < (l_a[:, None] + SOME_THRESHOLD)))

arr = arr1 & arr2
large_valid_inds, small_valid_inds = np.where(arr)

pd.DataFrame({'large_ind': np.take(l_ind, large_valid_inds), 
              'small_ind': np.take(s_ind, small_valid_inds)})

That gives you the following DF, which if I understood the question properly, is the expected solution:

	large_ind	small_ind
0	1621	1631
1	1622	1631
2	1623	1631
3	1624	1631
4	1625	1631
5	1626	1631
6	1627	1631
7	1628	1631
8	1629	1631
9	1630	1631
10	1992	2002
11	1993	2002
12	1994	2002
13	1995	2002
14	1996	2002
15	1997	2002
16	1998	2002
17	1999	2002
18	2000	2002
19	2001	2002
20	8751	8761
21	8752	8761
22	8753	8761
23	8754	8761
24	8755	8761
25	8756	8761
26	8757	8761
27	8758	8761
28	8759	8761
29	8760	8761
30	10516	10526
31	10517	10526
32	10518	10526
33	10519	10526
34	10520	10526
35	10521	10526
36	10522	10526
37	10523	10526
38	10524	10526
39	10525	10526
40	18448	18458
41	18449	18458
42	18450	18458
43	18451	18458
44	18452	18458
45	18453	18458
46	18454	18458
47	18455	18458
48	18456	18458
49	18457	18458

Upvotes: 2

Very slow filtering with multiple conditions in pandas dataframe

Answers (2)

Related Questions