Reputation: 3294
I have a dataset with a bit more than 1 Mio rows, containing several 100 TimeSeries. Here a simplified MWE of this data:
import pandas as pd
df = pd.DataFrame({"dtime":["2022-01-01", "2022-01-02", "2022-01-03", "2022-01-01", "2022-01-02", "2022-01-03",
"2022-01-01", "2022-01-02", "2022-01-03","2022-01-01", "2022-01-02", "2022-01-03"],
"Type":["A","A","A","B","B","B","C","C","C","D","D","D"],
"Value":[1,2,3,4,6,8,1,5,8,3,1,2]})
+----+------------+--------+---------+
| | dtime | Type | Value |
|----+------------+--------+---------|
| 0 | 2022-01-01 | A | 1 |
| 1 | 2022-01-02 | A | 2 |
| 2 | 2022-01-03 | A | 3 |
| 3 | 2022-01-01 | B | 4 |
| 4 | 2022-01-02 | B | 6 |
| 5 | 2022-01-03 | B | 8 |
| 6 | 2022-01-01 | C | 1 |
| 7 | 2022-01-02 | C | 5 |
| 8 | 2022-01-03 | C | 8 |
| 9 | 2022-01-01 | D | 3 |
| 10 | 2022-01-02 | D | 1 |
| 11 | 2022-01-03 | D | 2 |
+----+------------+--------+---------+
Type
represents the TimeSeries-Group, so A is one TimeSerie, B another and so on.
I want to train a multi dimensional TimeSeries-NN (like provided by the unit8-darts
package).
from darts.models import NBEATSModel
model = NBEATSModel(input_chunk_length=50, output_chunk_length=50, n_epochs=25)
model.fit([train1, train2, train3, train4])
For this I need the Type
separated and converted into TimeSeries format and finally split into train/test.
Like this:
from darts import TimeSeries
split_date = "2022-01-02"
series1 = TimeSeries.from_dataframe(df[df["Type"] == "A"], "dtime", "Value", freq="D", fillna_value=0)
series2 = TimeSeries.from_dataframe(df[df["Type"] == "B"], "dtime", "Value", freq="D", fillna_value=0)
series3 = TimeSeries.from_dataframe(df[df["Type"] == "C"], "dtime", "Value", freq="D", fillna_value=0)
series4 = TimeSeries.from_dataframe(df[df["Type"] == "D"], "dtime", "Value", freq="D", fillna_value=0)
train1, val1 = series1.split_before(pd.Timestamp(split_date))
train2, val2 = series2.split_before(pd.Timestamp(split_date))
train3, val3 = series3.split_before(pd.Timestamp(split_date))
train4, val4 = series4.split_before(pd.Timestamp(split_date))
But as the real world data has way more than 4 Type
to do this procedure manually would be an overkill and so I'm looking for a solution with a loop or a function to create this train
, test
and series
TS-objects.
And additionally to this the sequential series
, train
and test
TS-objects I want to create a list containing each trainX
name like:
ts_list = [train1, train2, train3, train4]
Does somebody has an idea how I can do this? I'm happy for any proposal.
Upvotes: 0
Views: 643
Reputation: 267
Did you tried to use a groupby over Type column in a loop :
train_list = []
for ts_type, group in df.groupby('Type'):
series = TimeSeries.from_dataframe(group, "dtime", "Value", freq="D", fillna_value=0)
train, val = series.split_before(pd.Timestamp(split_date))
train_list.append(train)
But, with a lot of data it could becomes computively expensive to use loop with Pandas. So maybe a better solution can be found (using other tools like spark for exemple).
Upvotes: 1
Reputation: 171
The answer given above looks good. I would add that before implementing this, you should probably ask yourself whether:
TimeSeries.from_dataframe()
only once.Upvotes: 0