Reputation: 53
I'm trying to merge two different DataFrames into a sorted one.
My input file can be represented like this:
import pandas as pd
rng = pd.date_range('05:37', periods= 5, freq='5Min')
df1 = pd.DataFrame({'key':['A','A','A','B','B'],'bus_code':['601','602','605','602','606'], 'time': rng})
df2 = pd.DataFrame({'bus_code':['601','602','603','604','605'],'distance':['0.1','0.5','1.2','1.9','2.5']})
print(df1)
print(df2)
key
in the column stands for different groups, so for each group, I want to full join them together with the route distance, thus the result should be like this:
bus_code key time distance
0 601 A 05:37:00 0.1
1 602 A 05:42:00 0.5
2 603 A NaT 1.2
3 604 A NaT 1.9
4 605 A 05:47:00 2.5
0 601 B NaT 0.1
1 602 B 05:52:00 0.5
2 603 B NaT 1.2
3 604 B NaT 1.9
4 605 B NaT 2.5
5 606 B 05:57:00 NaN
Below is what I've done to achieve this:
new_df = pd.DataFrame()
for name, group in df1.groupby('key'):
# print(group)
new_df = new_df.append(pd.merge(group, df2, left_on='bus_code', right_on='bus_code', how='outer').sort_values('distance'))
print(new_df)
But the result missed some key
value:
bus_code key time distance
0 601 A 05:37:00 0.1
1 602 A 05:42:00 0.5
2 605 A 05:47:00 2.5
3 603 NaN NaT 1.2
4 604 NaN NaT 1.9
0 602 B 05:52:00 0.5
1 601 NaN NaT 0.1
2 603 NaN NaT 1.2
3 604 NaN NaT 1.9
4 605 NaN NaT 2.5
5 606 B 05:57:00 NaN
So here is my question, how can I add back the missing key
value?
Upvotes: 0
Views: 170
Reputation: 402333
To address your original concern, you could use pd.Series.ffill
:
df.key.ffill(inplace=True)
print(df)
bus_code key time distance
0 601 A 05:37:00 0.1
1 602 A 05:42:00 0.5
2 605 A 05:47:00 2.5
3 603 A NaT 1.2
4 604 A NaT 1.9
0 602 B 05:52:00 0.5
1 601 B NaT 0.1
2 603 B NaT 1.2
3 604 B NaT 1.9
4 605 B NaT 2.5
5 606 B 05:57:00 NaN
As an alternative approach, you could use a groupby
and apply
operation instead:
g = df1.groupby('key', as_index=False)\
.apply(lambda x: x.merge(df2, on='bus_code', how='outer')\
.sort_values(['bus_code', 'key'])\
.set_index(['bus_code', 'time', 'distance']).bfill().reset_index())
.reset_index(drop=1)
print(g)
bus_code time distance key
0 601 2017-08-28 05:37:00 0.1 A
1 602 2017-08-28 05:42:00 0.5 A
2 603 NaT 1.2 A
3 604 NaT 1.9 A
4 605 2017-08-28 05:47:00 2.5 A
5 601 NaT 0.1 B
6 602 2017-08-28 05:52:00 0.5 B
7 603 NaT 1.2 B
8 604 NaT 1.9 B
9 605 NaT 2.5 B
10 606 2017-08-28 05:57:00 NaN B
Upvotes: 1