Teddy
Teddy

Reputation: 645

Efficient way of filtering groupby data in a Panda DataFrame

Issue

I have two dataframe Frame A (Holding some raw data) and Frame B (holding threshold data).

My goal is for each id in Frame B I want to return the rows for the corresponding id in Frame A that are <= b['A'].

The code I wrote works fine, my issue is with efficiency. Frame B has about 100K rows while Frame A has about 1M rows. Based on quick math it would take me ~6-hours to filter my data. While 6 hours may be feasible (I can have it run overnight) this dataset represents a small sample of my data ~5% of my total dataset.

What I have tried

import pandas as pd
from io import StringIO

data1 = """
id,A,B,C,D
123,2019-09-10 00:00:00,1,True,False
123,2019-09-10 00:10:00,1,True,False
123,2019-09-11 00:07:00,1,True,False
456,2019-09-05 01:00:00,1,True,False
456,2019-09-08 10:00:00,1,True,False
789,2019-09-10 10:00:00,1,True,False
789,2019-09-11 00:50:00,1,True,False
789,2019-09-12 12:00:00,1,True,False
"""

data2 = """
id,A,B
123,2019-09-10 00:10:00,1
456,2019-09-05 01:00:00,1
789,2019-09-12 12:00:00,1
"""

df_a = pd.read_csv(StringIO(data1))
df_b = pd.read_csv(StringIO(data2))

dfs = []
for Id in df_b.id.unique():
    df = df_a[df_a.id == Id]
    df = df[df['A'] <= df_b[df_b.id == Id]['A'].values[0]]
    dfs.append(df)

print(pd.concat(dfs))

Data

Desired Output

    id                    A  B     C      D
0  123  2019-09-10 00:00:00  1  True  False
1  123  2019-09-10 00:10:00  1  True  False
3  456  2019-09-05 01:00:00  1  True  False
5  789  2019-09-10 10:00:00  1  True  False
6  789  2019-09-11 00:50:00  1  True  False
7  789  2019-09-12 12:00:00  1  True  False

Frame A

   id                    A  B     C      D
0  123  2019-09-10 00:00:00  1  True  False
1  123  2019-09-10 00:10:00  1  True  False
2  123  2019-09-11 00:07:00  1  True  False
3  456  2019-09-05 01:00:00  1  True  False
4  456  2019-09-08 10:00:00  1  True  False
5  789  2019-09-10 10:00:00  1  True  False
6  789  2019-09-11 00:50:00  1  True  False
7  789  2019-09-12 12:00:00  1  True  False

Frame B

    id                    A  B
0  123  2019-09-10 00:10:00  1
1  456  2019-09-05 01:00:00  1
2  789  2019-09-12 12:00:00  1

Upvotes: 1

Views: 59

Answers (1)

Andy L.
Andy L.

Reputation: 25269

Using map

s = df_a.id.map(dict(df_b[['id', 'A']].values))
df_a[df_a.A <= s]

Out[35]:
    id                   A  B     C      D
0  123 2019-09-10 00:00:00  1  True  False
1  123 2019-09-10 00:10:00  1  True  False
3  456 2019-09-05 01:00:00  1  True  False
5  789 2019-09-10 10:00:00  1  True  False
6  789 2019-09-11 00:50:00  1  True  False
7  789 2019-09-12 12:00:00  1  True  False

Use left merge and query

(df_a.merge(df_b[['id', 'A']], on='id', how='left', suffixes=('','_y'))
     .query('A <= A_y').drop('A_y', 1))

Out[43]:
    id                   A  B     C      D
0  123 2019-09-10 00:00:00  1  True  False
1  123 2019-09-10 00:10:00  1  True  False
3  456 2019-09-05 01:00:00  1  True  False
5  789 2019-09-10 10:00:00  1  True  False
6  789 2019-09-11 00:50:00  1  True  False
7  789 2019-09-12 12:00:00  1  True  False

Upvotes: 3

Related Questions