Reputation: 6449
I have a daily temperature dataset and I am trying to build a model that operates on one week of data at a time. I've imported it into pandas DataFrame and grouped it by week (using the resample method). So far so good.
Please note, I do not want to aggregate the weekly data, I just want to group my "flat" dataset into weekly "chunks" that I can feed into the model one at a time.
I was able to accomplish it with the below code, but my question is:
How can I split this grouped DataFrame into training/validation sets?
Here is what I've tried so far (and mostly failed):
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
daily = pd.DataFrame(
data=np.random.rand(365) * 120, columns=["temp"],
index=pd.date_range(start="2019-01-01", end="2019-12-31", freq="d")
)
print("days:", len(daily))
weekly = daily.resample("W")
print("weeks:", len(weekly))
mask = np.random.rand(len(weekly)) < .8
# Both of these give KeyError: 'Columns not found: False, True'
train = weekly[mask]
valid = weekly[~mask]
# This also fails with KeyError: 'Columns not found: 12'
train, valid = train_test_split(weekly, train_size=.8)
UPDATE:
In the meantime, I came up with a pair of generators I can use for training/validation:
def gen_train(df, mask):
for index, (_, data) in enumerate(df):
if mask[index]: yield data
def gen_valid(df, mask):
for index, (_, data) in enumerate(df):
if not mask[index]: yield data
mask = np.random.rand(len(weekly)) < .8
model.fit(x=gen_train(weekly, mask), validation_data=get_valid(weekly, mask),
...
)
Unfortunately, this doesn't shuffle the data.
Can anyone come up with a better solution?
Upvotes: 2
Views: 2080
Reputation: 711
from itertools import compress
train = compress(weekly, mask)
valid = compress(weekly, ~mask)
Upvotes: 1
Reputation: 2059
Your issue is that you're not completing the resample
method. Choose a method to resample and your code works:
...
weekly = daily.resample("W").mean() # <- Note the call to complete the resample with weekly mean
train, valid = train_test_split(weekly, train_size=.8)
train.shape
# (42, 1)
valid.shape
# (11, 1)
42 / (42 + 11)
# 0.7924528301886793
EDIT: If you don't want to resample, just loop through weeks with a groupby:
...
for date, week in daily.groupby(pd.Grouper(freq='W')):
train, valid = train_test_split(week, train_size=.8)
print(date)
print(train.shape)
print(valid.shape)
2019-01-06 00:00:00
(4, 1)
(2, 1)
2019-01-13 00:00:00
(5, 1)
(2, 1)
2019-01-20 00:00:00
(5, 1)
(2, 1)
2019-01-27 00:00:00
(5, 1)
(2, 1)
2019-02-03 00:00:00
(5, 1)
(2, 1)
...
EDIT: If you want to sample weeks as the unit of observation, you'll want to make a new column for them:
daily['week'] = daily.index.year.astype(str) + '-' + daily.index.week.astype(str)
temp week
2019-01-01 98.551345 2019-1
2019-01-02 103.880149 2019-1
2019-01-03 48.187819 2019-1
2019-01-04 116.942540 2019-1
2019-01-05 21.342152 2019-1
... ... ...
Then train/test split the weeks and select the rows:
train_weeks, test_weeks = train_test_split(daily.week.unique(), train_size=.8)
train = daily[daily.week.isin(train_weeks)]
test = daily[daily.week.isin(test_weeks)]
train.shape
#(288, 2)
test.shape
#(77, 2)
Upvotes: 1