Reputation: 645
Issue
I have two dataframe Frame A
(Holding some raw data) and Frame B
(holding threshold data).
My goal is for each id in Frame B
I want to return the rows for the corresponding id in Frame A
that are <= b['A']
.
The code I wrote works fine, my issue is with efficiency. Frame B has about 100K rows while Frame A has about 1M rows. Based on quick math it would take me ~6-hours to filter my data. While 6 hours may be feasible (I can have it run overnight) this dataset represents a small sample of my data ~5% of my total dataset.
What I have tried
import pandas as pd
from io import StringIO
data1 = """
id,A,B,C,D
123,2019-09-10 00:00:00,1,True,False
123,2019-09-10 00:10:00,1,True,False
123,2019-09-11 00:07:00,1,True,False
456,2019-09-05 01:00:00,1,True,False
456,2019-09-08 10:00:00,1,True,False
789,2019-09-10 10:00:00,1,True,False
789,2019-09-11 00:50:00,1,True,False
789,2019-09-12 12:00:00,1,True,False
"""
data2 = """
id,A,B
123,2019-09-10 00:10:00,1
456,2019-09-05 01:00:00,1
789,2019-09-12 12:00:00,1
"""
df_a = pd.read_csv(StringIO(data1))
df_b = pd.read_csv(StringIO(data2))
dfs = []
for Id in df_b.id.unique():
df = df_a[df_a.id == Id]
df = df[df['A'] <= df_b[df_b.id == Id]['A'].values[0]]
dfs.append(df)
print(pd.concat(dfs))
Data
Desired Output
id A B C D
0 123 2019-09-10 00:00:00 1 True False
1 123 2019-09-10 00:10:00 1 True False
3 456 2019-09-05 01:00:00 1 True False
5 789 2019-09-10 10:00:00 1 True False
6 789 2019-09-11 00:50:00 1 True False
7 789 2019-09-12 12:00:00 1 True False
Frame A
id A B C D
0 123 2019-09-10 00:00:00 1 True False
1 123 2019-09-10 00:10:00 1 True False
2 123 2019-09-11 00:07:00 1 True False
3 456 2019-09-05 01:00:00 1 True False
4 456 2019-09-08 10:00:00 1 True False
5 789 2019-09-10 10:00:00 1 True False
6 789 2019-09-11 00:50:00 1 True False
7 789 2019-09-12 12:00:00 1 True False
Frame B
id A B
0 123 2019-09-10 00:10:00 1
1 456 2019-09-05 01:00:00 1
2 789 2019-09-12 12:00:00 1
Upvotes: 1
Views: 59
Reputation: 25269
Using map
s = df_a.id.map(dict(df_b[['id', 'A']].values))
df_a[df_a.A <= s]
Out[35]:
id A B C D
0 123 2019-09-10 00:00:00 1 True False
1 123 2019-09-10 00:10:00 1 True False
3 456 2019-09-05 01:00:00 1 True False
5 789 2019-09-10 10:00:00 1 True False
6 789 2019-09-11 00:50:00 1 True False
7 789 2019-09-12 12:00:00 1 True False
Use left merge
and query
(df_a.merge(df_b[['id', 'A']], on='id', how='left', suffixes=('','_y'))
.query('A <= A_y').drop('A_y', 1))
Out[43]:
id A B C D
0 123 2019-09-10 00:00:00 1 True False
1 123 2019-09-10 00:10:00 1 True False
3 456 2019-09-05 01:00:00 1 True False
5 789 2019-09-10 10:00:00 1 True False
6 789 2019-09-11 00:50:00 1 True False
7 789 2019-09-12 12:00:00 1 True False
Upvotes: 3