Reputation: 13
I have a Dataframe "timeseries" which has datetimes as its index and I have a PeriodIndex "on":
import numpy as np
import pandas as pd
timeseries = pd.DataFrame(
index=pd.DatetimeIndex(
[
"2000-01-01 12:00:00Z",
"2000-01-01 13:00:00Z",
"2000-01-01 14:00:00Z",
"2000-01-02 13:00:00Z",
"2000-01-02 18:00:00Z",
"2000-01-03 14:00:00Z",
"2000-01-03 20:00:00Z",
"2000-01-04 13:00:00Z",
]
),
data={
"value1": [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
},
)
on = pd.PeriodIndex(
["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"], freq="D"
)
I would like to add a column to "timeseries" that contains the period in "on" that each respective datetime is in:
value1 period
2000-01-01 12:00:00+00:00 6.0 2000-01-01
2000-01-01 13:00:00+00:00 5.0 2000-01-01
2000-01-01 14:00:00+00:00 3.0 2000-01-01
2000-01-02 13:00:00+00:00 7.0 2000-01-02
2000-01-02 18:00:00+00:00 4.0 2000-01-02
2000-01-03 14:00:00+00:00 4.0 NaN
2000-01-03 20:00:00+00:00 5.0 NaN
2000-01-04 13:00:00+00:00 3.0 2000-01-04
So far I have achieved this with a for-loop:
timeseries["period"] = np.NaN
for period in on:
datetimes_in_period = timeseries.index[
(timeseries.index >= period.start_time.tz_localize("UTC"))
& (timeseries.index <= period.end_time.tz_localize("UTC"))
]
timeseries["period"].loc[datetimes_in_period] = period
For efficiency's sake I want to avoid loops in Python. How can I vectorize this code?
Upvotes: 1
Views: 58
Reputation: 21580
You can still use .merge_asof
as @rorshan suggested.
If you create a dataframe of the start/end intervals:
df_on = pd.DataFrame({
"period": on,
"start_time": on.start_time.tz_localize("UTC"),
"end_time": on.end_time.tz_localize("UTC"),
})
df = pd.merge_asof(timeseries, df_on, left_index=True, right_on="start_time")
# blank out period when not inside
df["period"] = df["period"].where((df.index >= df["start_time"]) & (df.index <= df["end_time"]))
>>> df[["value1", "period"]]
value1 period
2000-01-01 12:00:00+00:00 6.0 2000-01-01
2000-01-01 13:00:00+00:00 5.0 2000-01-01
2000-01-01 14:00:00+00:00 3.0 2000-01-01
2000-01-02 13:00:00+00:00 7.0 2000-01-02
2000-01-02 18:00:00+00:00 4.0 2000-01-02
2000-01-03 14:00:00+00:00 4.0 NaT
2000-01-03 20:00:00+00:00 5.0 NaT
2000-01-04 13:00:00+00:00 3.0 2000-01-04
Upvotes: 0
Reputation: 166
Note: This answer originally referred to a question which has since been edited to change the intent
Here's my solution:
import pandas as pd
timeseries = pd.DataFrame(
index=pd.DatetimeIndex(
[
"2000-01-01 12:00:00Z",
"2000-01-01 13:00:00Z",
"2000-01-01 14:00:00Z",
"2000-01-02 13:00:00Z",
"2000-01-02 18:00:00Z",
"2000-01-03 14:00:00Z",
"2000-01-03 20:00:00Z",
"2000-01-04 13:00:00Z",
]
),
data={
"value1": [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
},
)
on = pd.PeriodIndex(
["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"], freq="D"
)
merge = (pd.merge_asof(timeseries.index.to_frame(),
on.to_timestamp().to_frame(),
right_index=True, left_index=True)
.drop('0_x', axis=1)
.reset_index()
.rename({'0_y':'date', 'index':'period'}, axis=1)
)
#extracting from `on` any date which does not have a matching date in timestamp
unmatched_periods = on.to_timestamp().difference(merge.date).to_frame()
unmatched_periods[0] = pd.NaT
merge = merge.groupby('date').agg(func=lambda x: list(x))
unmatched_periods.columns = merge.columns
merge = pd.concat((merge, unmatched_periods))
merge
I have never used PeriodIndex
before, and was forced to convert it to a DateTimeIndex
with to_timestamp
. From looking at the documentation it seems that PeriodIndex
is intended to create dates/periods programmatically (e.g. every X days between two days), which doesn't seem to be quite what it's being used for here.
Anyway the core of the solution is to use merge_asof
which is like merge
but instead of requiring equal keys, it will look for the closest key. By default it will look backwards which is what we want (the closest date in on
which is before the date in timeseries
).
Then we use groupby
and agg
to get the groups.
We also need to get the dates in on
which did not have any match in timeseries
(in this case 2000-01-05
Note: you say you "avoid" loops for efficiency. In theory that's a good idea, but be careful that the result you're trying to achieve (having lists as values in a column) is by itself quite inefficient, on top of pandas groupby
also being fairly demanding.
Upvotes: 0