Pandas groupby training/validation split

I have a daily temperature dataset and I am trying to build a model that operates on one week of data at a time. I've imported it into pandas DataFrame and grouped it by week (using the resample method). So far so good.

Please note, I do not want to aggregate the weekly data, I just want to group my "flat" dataset into weekly "chunks" that I can feed into the model one at a time.

I was able to accomplish it with the below code, but my question is:

How can I split this grouped DataFrame into training/validation sets?

Here is what I've tried so far (and mostly failed):

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

daily = pd.DataFrame(
    data=np.random.rand(365) * 120, columns=["temp"],
    index=pd.date_range(start="2019-01-01", end="2019-12-31", freq="d")
)
print("days:", len(daily))

weekly = daily.resample("W")
print("weeks:", len(weekly))

mask = np.random.rand(len(weekly)) < .8
# Both of these give KeyError: 'Columns not found: False, True'
train = weekly[mask]
valid = weekly[~mask]

# This also fails with KeyError: 'Columns not found: 12'
train, valid = train_test_split(weekly, train_size=.8)

UPDATE:

In the meantime, I came up with a pair of generators I can use for training/validation:

def gen_train(df, mask):
    for index, (_, data) in enumerate(df):
        if mask[index]: yield data

def gen_valid(df, mask):
    for index, (_, data) in enumerate(df):
        if not mask[index]: yield data

mask = np.random.rand(len(weekly)) < .8

model.fit(x=gen_train(weekly, mask), validation_data=get_valid(weekly, mask),
    ...
)

Unfortunately, this doesn't shuffle the data.

Can anyone come up with a better solution?

Upvotes: 2

Views: 2080

Answers (2)

Ehsan
Ehsan

Reputation: 711

Use itertools.compress

from itertools import compress

train = compress(weekly, mask)
valid = compress(weekly, ~mask)

Upvotes: 1

Dave
Dave

Reputation: 2059

Your issue is that you're not completing the resample method. Choose a method to resample and your code works:

...
weekly = daily.resample("W").mean() # <- Note the call to complete the resample with weekly mean
train, valid = train_test_split(weekly, train_size=.8)

train.shape
# (42, 1)

valid.shape
# (11, 1)

42 / (42 + 11)
# 0.7924528301886793

EDIT: If you don't want to resample, just loop through weeks with a groupby:

...
for date, week in daily.groupby(pd.Grouper(freq='W')):
    train, valid = train_test_split(week, train_size=.8)
    print(date)
    print(train.shape)
    print(valid.shape)

2019-01-06 00:00:00
(4, 1)
(2, 1)
2019-01-13 00:00:00
(5, 1)
(2, 1)
2019-01-20 00:00:00
(5, 1)
(2, 1)
2019-01-27 00:00:00
(5, 1)
(2, 1)
2019-02-03 00:00:00
(5, 1)
(2, 1)
...

EDIT: If you want to sample weeks as the unit of observation, you'll want to make a new column for them:

daily['week'] = daily.index.year.astype(str) + '-' + daily.index.week.astype(str)

                  temp     week
2019-01-01   98.551345   2019-1
2019-01-02  103.880149   2019-1
2019-01-03   48.187819   2019-1
2019-01-04  116.942540   2019-1
2019-01-05   21.342152   2019-1
...                ...      ...

Then train/test split the weeks and select the rows:

train_weeks, test_weeks = train_test_split(daily.week.unique(), train_size=.8)
train = daily[daily.week.isin(train_weeks)]
test = daily[daily.week.isin(test_weeks)]

train.shape
#(288, 2)

test.shape
#(77, 2)

Upvotes: 1

Related Questions