Reputation: 8966
I have two pandas dataframes that I'm trying to combine into a single dataframe. Here's how I set them up:
a = {'date':['1/1/2015 00:00','1/1/2015 00:15','1/1/2015 00:30'], 'num':[1,2,3]}
b = {'date':['1/1/2015 01:15','1/1/2015 01:30','1/1/2015 01:45'], 'num':[4,5,6]}
dfa = pd.DataFrame(a)
dfb = pd.DataFrame(b)
dfa['date'] = dfa['date'].apply(pd.to_datetime)
dfb['date'] = dfb['date'].apply(pd.to_datetime)
I then find the earliest
and latest
time stamps from each, and create a new dataframe that starts as just a date
series:
earliest = min(dfa['date'].min(), dfb['date'].min())
latest = max(dfa['date'].max(), dfb['date'].max())
date_range = pd.date_range(earliest, latest, freq='15min')
dfd = pd.DataFrame({'date':date_range})
I then want to merge them all into a single dataframe with dfd
being the base as it will contain all of the proper time stamps. So I merge dfd
and dfa
and all is good:
dfd = pd.merge(dfd, dfa, how = 'outer', on = 'date')
However, when I merge it with dfb
the date
series gets screwy and I can't figure out why.
dfd = pd.merge(dfd, dfb, how = 'outer', on = ['date','num'])
...yields:
date num
0 2015-01-01 00:00:00 1.0
1 2015-01-01 00:15:00 2.0
2 2015-01-01 00:30:00 3.0
3 2015-01-01 00:45:00 NaN
4 2015-01-01 01:00:00 NaN
5 2015-01-01 01:15:00 NaN
6 2015-01-01 01:30:00 NaN
7 2015-01-01 01:45:00 NaN
8 2015-01-01 01:15:00 4.0
9 2015-01-01 01:30:00 5.0
10 2015-01-01 01:45:00 6.0
Where I would expect 4.0
to fill in the 2015-01-01 01:15:00
time slot, etc. and not create new rows.
Or if I try:
dfd = pd.merge(dfd, dfb, how = 'outer', on = 'date')
I get:
date num_x num_y
0 2015-01-01 00:00:00 1.0 NaN
1 2015-01-01 00:15:00 2.0 NaN
2 2015-01-01 00:30:00 3.0 NaN
3 2015-01-01 00:45:00 NaN NaN
4 2015-01-01 01:00:00 NaN NaN
5 2015-01-01 01:15:00 NaN 4.0
6 2015-01-01 01:30:00 NaN 5.0
7 2015-01-01 01:45:00 NaN 6.0
which is also not what I want (just want a single num
column). Any help would be appreciated.
Upvotes: 2
Views: 261
Reputation: 38415
This works:
a = {'date':['1/1/2015 00:00','1/1/2015 00:15','1/1/2015 00:30'], 'num':[1,2,3]}
b = {'date':['1/1/2015 01:15','1/1/2015 01:30','1/1/2015 01:45'], 'num':[4,5,6]}
dfa = pd.DataFrame(a)
dfb = pd.DataFrame(b)
dfa['date'] = dfa['date'].apply(pd.to_datetime)
dfb['date'] = dfb['date'].apply(pd.to_datetime)
earliest = min(dfa['date'].min(), dfb['date'].min())
latest = max(dfa['date'].max(), dfb['date'].max())
date_range = pd.date_range(earliest, latest, freq='15min')
dfd = pd.DataFrame({'date':date_range})
df_dates = pd.merge(dfa, dfb, how = 'outer')
df_final = pd.merge(dfd, df_dates, how = 'outer')
df_final
Upvotes: 0
Reputation: 294258
dfa.set_index('date').combine_first(dfb.set_index('date')) \
.asfreq('15T').reset_index()
date num
0 2015-01-01 00:00:00 1.0000
1 2015-01-01 00:15:00 2.00
2 2015-01-01 00:30:00 3.00
3 2015-01-01 00:45:00 nan
4 2015-01-01 01:00:00 nan
5 2015-01-01 01:15:00 4.00
6 2015-01-01 01:30:00 5.00
7 2015-01-01 01:45:00 6.00
another solution
dfa.append(dfb).set_index('date').asfreq('15T').reset_index()
Upvotes: 2
Reputation: 2620
Merge dfa and dfb first:
d = pd.merge(dfa, dfb, on=['date','num'], how='outer')
Then combine the result with dfd as you defined:
result = pd.merge(d, dfd, on='date', how='outer')
print result.sort('date')
Output:
date num
0 2015-01-01 00:00:00 1.0
1 2015-01-01 00:15:00 2.0
2 2015-01-01 00:30:00 3.0
6 2015-01-01 00:45:00 NaN
7 2015-01-01 01:00:00 NaN
3 2015-01-01 01:15:00 4.0
4 2015-01-01 01:30:00 5.0
5 2015-01-01 01:45:00 6.0
Upvotes: 1