Reputation: 317
I have some sparse higher frequency data (unevenly spaced) and some low frequency data (daily).
How can I join this data and append corresponding low frequency data columns to the higher frequency data if it falls on that day?
One way would be to create a custom apply function and check each datum's YMD and look up the corresponding low frequency data, but that seems pretty inefficient.
Here's an example DataFrame which demonstrates the problem:
df1 = DataFrame(dict(date1 = date_range(start='20100101', periods=48, freq='H'),value1=range(48)))
df2 = DataFrame(dict(date2 = date_range(start='20100101', periods=2, freq='D'),value2=range(2)))
I've tried pd.merge and and pd.join but they don't match and produce NaNs.
merge(df1,df2,left_on='date1',right_on='date2',how='outer')
date1 value1 date2 value2
0 2010-01-01 00:00:00 0 2010-01-01 0
1 2010-01-01 01:00:00 1 NaT NaN
2 2010-01-01 02:00:00 2 NaT NaN
3 2010-01-01 03:00:00 3 NaT NaN
...
1 2010-01-01 01:00:00 1 2010-01-02 1
24 2010-01-02 00:00:00 24 NaT NaN
25 2010-01-02 01:00:00 25 NaT NaN
...
30 2010-01-02 06:00:00 30 NaT NaN
31 2010-01-02 07:00:00 31 NaT NaN
The output I'm hoping for should have value2 be 0 for everything on the 1st and 1 for everything on the 2nd:
date1 value1 date2 value2
0 2010-01-01 00:00:00 0 2010-01-01 0
1 2010-01-01 01:00:00 1 2010-01-01 0
2 2010-01-01 02:00:00 2 2010-01-01 0
3 2010-01-01 03:00:00 3 2010-01-01 0
...
29 2010-01-02 05:00:00 29 2010-01-02 1
30 2010-01-02 06:00:00 30 2010-01-02 1
31 2010-01-02 07:00:00 31 2010-01-02 1
Upvotes: 2
Views: 1543
Reputation: 375915
Note: you can do this super cleanly with a merge (assuming no overlapping columns):
In [41]: df1['date2'] = pd.DatetimeIndex(df1['date1']).normalize()
In [42]: pd.merge(df1, df2).head()
Out[42]:
date1 value1 date2 value2
0 2010-01-01 00:00:00 0 2010-01-01 0
1 2010-01-01 01:00:00 1 2010-01-01 0
2 2010-01-01 02:00:00 2 2010-01-01 0
3 2010-01-01 03:00:00 3 2010-01-01 0
4 2010-01-01 04:00:00 4 2010-01-01 0
Original answer, which I thought may be more efficient. is to do this with a reindex:
Just to make things easier let's set date2 as the index:
In [11]: df2 = df2.set_index('date2')
Now reindex on the start of the day (with normalize, in 0.15 you'll be able to use .dt.normalize()
):
In [12]: pd.DatetimeIndex(df1.date1).normalize()
Out[12]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2010-01-01, ..., 2010-01-02]
Length: 48, Freq: None, Timezone: None
In [13]: df2.reindex(pd.DatetimeIndex(df1.date1).normalize()).head()
Out[13]:
value2
2010-01-01 0
2010-01-01 0
2010-01-01 0
2010-01-01 0
2010-01-01 0
You have to use the values to avoid pandas realigning on the index:
In [14]: df1['value2'] = df2.reindex(pd.DatetimeIndex(df1.date1).normalize()).values
In [15]: df1.head()
Out[15]:
date1 value1 value2
0 2010-01-01 00:00:00 0 0
1 2010-01-01 01:00:00 1 0
2 2010-01-01 02:00:00 2 0
3 2010-01-01 03:00:00 3 0
4 2010-01-01 04:00:00 4 0
Upvotes: 2