Marco_CH
Marco_CH

Reputation: 3294

Create hundreds of TimeSeries & Train-/Testsets with loop or function

MWE

I have a dataset with a bit more than 1 Mio rows, containing several 100 TimeSeries. Here a simplified MWE of this data:

import pandas as pd

df = pd.DataFrame({"dtime":["2022-01-01", "2022-01-02", "2022-01-03", "2022-01-01", "2022-01-02", "2022-01-03",
                           "2022-01-01", "2022-01-02", "2022-01-03","2022-01-01", "2022-01-02", "2022-01-03"],
                   "Type":["A","A","A","B","B","B","C","C","C","D","D","D"],
                   "Value":[1,2,3,4,6,8,1,5,8,3,1,2]})

+----+------------+--------+---------+
|    | dtime      | Type   |   Value |
|----+------------+--------+---------|
|  0 | 2022-01-01 | A      |       1 |
|  1 | 2022-01-02 | A      |       2 |
|  2 | 2022-01-03 | A      |       3 |
|  3 | 2022-01-01 | B      |       4 |
|  4 | 2022-01-02 | B      |       6 |
|  5 | 2022-01-03 | B      |       8 |
|  6 | 2022-01-01 | C      |       1 |
|  7 | 2022-01-02 | C      |       5 |
|  8 | 2022-01-03 | C      |       8 |
|  9 | 2022-01-01 | D      |       3 |
| 10 | 2022-01-02 | D      |       1 |
| 11 | 2022-01-03 | D      |       2 |
+----+------------+--------+---------+

Type represents the TimeSeries-Group, so A is one TimeSerie, B another and so on.


Goal

I want to train a multi dimensional TimeSeries-NN (like provided by the unit8-dartspackage).

from darts.models import NBEATSModel
model = NBEATSModel(input_chunk_length=50, output_chunk_length=50, n_epochs=25)
model.fit([train1, train2, train3, train4])

For this I need the Type separated and converted into TimeSeries format and finally split into train/test. Like this:

from darts import TimeSeries

split_date = "2022-01-02"

series1 = TimeSeries.from_dataframe(df[df["Type"] == "A"], "dtime", "Value", freq="D", fillna_value=0)
series2 = TimeSeries.from_dataframe(df[df["Type"] == "B"], "dtime", "Value", freq="D", fillna_value=0)
series3 = TimeSeries.from_dataframe(df[df["Type"] == "C"], "dtime", "Value", freq="D", fillna_value=0)
series4 = TimeSeries.from_dataframe(df[df["Type"] == "D"], "dtime", "Value", freq="D", fillna_value=0)
train1, val1 = series1.split_before(pd.Timestamp(split_date))
train2, val2 = series2.split_before(pd.Timestamp(split_date))
train3, val3 = series3.split_before(pd.Timestamp(split_date))
train4, val4 = series4.split_before(pd.Timestamp(split_date))

But as the real world data has way more than 4 Type to do this procedure manually would be an overkill and so I'm looking for a solution with a loop or a function to create this train, test and series TS-objects.

And additionally to this the sequential series, train and test TS-objects I want to create a list containing each trainX name like:

ts_list = [train1, train2, train3, train4]

Does somebody has an idea how I can do this? I'm happy for any proposal.

Upvotes: 0

Views: 643

Answers (2)

Léo Beaucourt
Léo Beaucourt

Reputation: 267

Did you tried to use a groupby over Type column in a loop :

train_list = []
for ts_type, group in df.groupby('Type'):
    series = TimeSeries.from_dataframe(group, "dtime", "Value", freq="D", fillna_value=0)
    train, val = series.split_before(pd.Timestamp(split_date))
    train_list.append(train)

But, with a lot of data it could becomes computively expensive to use loop with Pandas. So maybe a better solution can be found (using other tools like spark for exemple).

Upvotes: 1

Julien Herzen
Julien Herzen

Reputation: 171

The answer given above looks good. I would add that before implementing this, you should probably ask yourself whether:

  • You want to model your data using one time series per group. In this case, the proposed option looping over groups looks good. You should probably use this representation if the groups representing distinct "observations" of some the same underlying phenomenon (e.g., heart rate series of two distinct patients).
  • You want to model your data using one time series for all groups, where each group make one dimension of this (multivariate) time series. You should use this when each of the group represent a distinct dimension making up an observation (e.g., heart rate and blood pressure of a single patient). In this latter case, you should transform the dataframe to have the groups in separate columns, and call TimeSeries.from_dataframe() only once.

Upvotes: 0

Related Questions