Ethan
Ethan

Reputation: 53

Merge DataFrame based on groups

I'm trying to merge two different DataFrames into a sorted one.

My input file can be represented like this:

import pandas as pd
rng = pd.date_range('05:37', periods= 5, freq='5Min')
df1 = pd.DataFrame({'key':['A','A','A','B','B'],'bus_code':['601','602','605','602','606'], 'time': rng})
df2 = pd.DataFrame({'bus_code':['601','602','603','604','605'],'distance':['0.1','0.5','1.2','1.9','2.5']})
print(df1)
print(df2) 

key in the column stands for different groups, so for each group, I want to full join them together with the route distance, thus the result should be like this:

     bus_code key   time      distance
0      601    A   05:37:00      0.1
1      602    A   05:42:00      0.5
2      603    A        NaT      1.2
3      604    A        NaT      1.9
4      605    A   05:47:00      2.5

0      601    B        NaT      0.1
1      602    B   05:52:00      0.5   
2      603    B        NaT      1.2
3      604    B        NaT      1.9
4      605    B        NaT      2.5
5      606    B   05:57:00      NaN

Below is what I've done to achieve this:

new_df = pd.DataFrame()
for name, group in df1.groupby('key'):
#   print(group)
    new_df = new_df.append(pd.merge(group, df2, left_on='bus_code', right_on='bus_code', how='outer').sort_values('distance'))

print(new_df)

But the result missed some key value:

     bus_code key    time    distance
0      601    A   05:37:00      0.1
1      602    A   05:42:00      0.5
2      605    A   05:47:00      2.5
3      603  NaN        NaT      1.2
4      604  NaN        NaT      1.9
0      602    B   05:52:00      0.5
1      601  NaN        NaT      0.1
2      603  NaN        NaT      1.2
3      604  NaN        NaT      1.9
4      605  NaN        NaT      2.5
5      606    B   05:57:00      NaN

So here is my question, how can I add back the missing key value?

Upvotes: 0

Views: 170

Answers (1)

cs95
cs95

Reputation: 402333

To address your original concern, you could use pd.Series.ffill:

df.key.ffill(inplace=True)
print(df)

   bus_code key      time  distance
0       601   A  05:37:00       0.1
1       602   A  05:42:00       0.5
2       605   A  05:47:00       2.5
3       603   A       NaT       1.2
4       604   A       NaT       1.9
0       602   B  05:52:00       0.5
1       601   B       NaT       0.1
2       603   B       NaT       1.2
3       604   B       NaT       1.9
4       605   B       NaT       2.5
5       606   B  05:57:00       NaN

As an alternative approach, you could use a groupby and apply operation instead:

g = df1.groupby('key', as_index=False)\
              .apply(lambda x: x.merge(df2, on='bus_code', how='outer')\
              .sort_values(['bus_code', 'key'])\
              .set_index(['bus_code', 'time', 'distance']).bfill().reset_index())
       .reset_index(drop=1)   

print(g)

   bus_code                time distance key
0       601 2017-08-28 05:37:00      0.1   A
1       602 2017-08-28 05:42:00      0.5   A
2       603                 NaT      1.2   A
3       604                 NaT      1.9   A
4       605 2017-08-28 05:47:00      2.5   A
5       601                 NaT      0.1   B
6       602 2017-08-28 05:52:00      0.5   B
7       603                 NaT      1.2   B
8       604                 NaT      1.9   B
9       605                 NaT      2.5   B
10      606 2017-08-28 05:57:00      NaN   B 

Upvotes: 1

Related Questions