LSTM forecasting with single categorical feature

Question

I'm pretty new to time series.
This is the dataset I'm working on:

           Date   Price               Location
0    2012-01-01  1771.0                 Marche
1    2012-01-01  1039.0               Calabria
2    2012-01-01  2193.0               Campania
3    2012-01-01  2015.0         Emilia-Romagna
4    2012-01-01  1483.0  Friuli-Venezia Giulia
...         ...     ...                    ...
2475 2022-04-01  1963.0                  Lazio
2476 2022-04-01  1362.0  Friuli-Venezia Giulia
2477 2022-04-01  1674.0         Emilia-Romagna
2478 2022-04-01  1388.0                 Marche
2479 2022-04-01  1103.0                Abruzzo

I'm trying to build an LSTM for price prediction, but I don't know how to manage the Location categorical feature: do I have to use one-hot encoding or a groupby? What I want to predict is the price based on the location.
How can I achieve that? A Python solution is particularly appreciated.

Thanks in advance.

Salvatore Daniele Bianco · Accepted Answer

Suppose my dataset (df) is analogous to yours:

          Date       Price  Location
0   2021-01-01  791.076890  Campania
1   2021-01-01  705.702464  Lombardia
2   2021-01-01  719.991382  Sicilia
3   2021-02-01  825.760917  Lombardia
4   2021-02-01  747.734309  Sicilia
...        ...         ...        ...
31  2021-11-01  886.874348  Lombardia
32  2021-11-01  935.040583  Campania
33  2021-12-01  771.165378  Sicilia
34  2021-12-01  952.255227  Campania
35  2021-12-01  939.754515  Lombardia

In my case I have a Price record for 3 regions (Campania, Lombardia, Sicilia) every month. My Idea is to treat the different region as different features, so I would transform df as:

df = df.set_index(["Date", "Location"]).Price.unstack()

Now my dataset is like:

Location    Campania    Lombardia   Sicilia
Date            
2021-01-01  791.076890  705.702464  719.991382
2021-02-01  758.872755  825.760917  747.734309
2021-03-01  880.038005  803.165998  837.738419
       ...         ...         ...         ...
2021-10-01  908.402345  805.081193  792.369610
2021-11-01  935.040583  886.874348  736.862025
2021-12-01  952.255227  939.754515  771.165378

Please, after this, make sure there are no NaN values (df.isna().sum()).

Now you can pass this data to a multi feature RNN (or LSTM), as made in this example, or to a multi-channel 1D-CNN (choosing an appropriate kernel size). The only problem in both cases could be the small size of the dataset, so try to not to over-parameterize the model (for example reducing the number of neurons and layers), otherwise the over-fitting will be unavoidable. About this you can test the model on the last 20% of your time-series:

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, shuffle=False, test_size=.2)

The last part is to build a matching (X, Y) for the supervised learning, but this depends on what model are you using and what is your prediction task. Another example here.

LSTM forecasting with single categorical feature

Answers (1)

Related Questions