Reputation: 8954
I have a csv file that looks like this when I load it:
# generate example data
users = ['A', 'B', 'C', 'D']
#dates = pd.date_range("2020-02-01 00:00:00", "2020-04-04 20:00:00", freq="H")
dates = pd.date_range("2020-02-01 00:00:00", "2020-02-04 20:00:00", freq="H")
idx = pd.MultiIndex.from_product([users, dates])
idx.names = ["user", "datehour"]
y = pd.Series(np.random.choice(a=[0, 1], size=len(idx)), index=idx).rename('y')
# write to csv and reload (turns out this matters)
y.to_csv('reprod_example.csv')
y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.set_index(['user', 'datehour']).y
>>> y.head()
user datehour
A 2020-02-01 00:00:00 0
2020-02-01 01:00:00 0
2020-02-01 02:00:00 1
2020-02-01 03:00:00 0
2020-02-01 04:00:00 0
Name: y, dtype: int64
I have the following function to create a lagged feature of an index level:
def shift_index(a, dt_idx_name, lag_freq, lag):
# get datetime index of relevant level
ac = a.copy()
dti = ac.index.get_level_values(dt_idx_name)
# shift it
dti_shifted = dti.shift(lag, freq=lag_freq)
# put it back where you found it
ac.index.set_levels(dti_shifted, level=dt_idx_name, inplace=True)
return ac
But when I run:
y_lag = shift_index(y, 'datehour', 'H', 1)
, I get the following error:
ValueError: Level values must be unique...
(I can actually suppress this error by adding verify_integrity=False
in .index.set_levels...
in the function, but that (predictably) causes problems down the line)
Here's the weird part. If you run the example above but without saving/reloading from csv, it works. The reason seems to be, I think, that y.index.get_level_value('datehour')
shows a freq='H'
attribute right after it's created, but freq=None
once its reloaded from csv.
That makes sense, csv obviously doesn't save that metadata. But I've found it surprisingly difficult to set the freq attribute for a MultiIndexed series. For example this did nothing.
df.index.freq = pd.tseries.frequencies.to_offset("H")
. And this answer also didn't work for my MultiIndex.
So I think I could solve this if I were able to set the freq
attribute of the DateTime component of my MultiIndex. But my ultimate goal is to be create a version of my y
data with shifted DateTime MultiIndex component, such as with my shift_index
function above. Since I receive my data via csv, "just don't save to csv and reload" is not an option.
Upvotes: 2
Views: 738
Reputation: 2748
After much fidgeting, I was able to set an hourly frequency using asfreq('H')
on grouped data, such that each group has unique values for the datehour
index.
y = pd.read_csv('reprod_example.csv', parse_dates=['datehour'])
y = y.groupby('user').apply(lambda df: df.set_index('datehour').asfreq('H')).y
Peeking at an index value shows the correct frequency.
y.index[0]
# ('A', Timestamp('2020-02-01 00:00:00', freq='H'))
All this is doing is setting the index in two parts. The user
goes first so that the nested datehour
index can be unique within it. Once the datehour
index is unique, then asfreq
can be used without difficulty.
If you try asfreq
on a non-unique index, it will not work.
y_load.set_index('datehour').asfreq('H')
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
# <ipython-input-433-3ba51b619417> in <module>
# ----> 1 y_load.set_index('datehour').asfreq('H')
# ...
# ValueError: cannot reindex from a duplicate axis
Upvotes: 2