LSTM forecasting with single categorical feature

I'm pretty new to time series.
This is the dataset I'm working on:

           Date   Price               Location
0    2012-01-01  1771.0                 Marche
1    2012-01-01  1039.0               Calabria
2    2012-01-01  2193.0               Campania
3    2012-01-01  2015.0         Emilia-Romagna
4    2012-01-01  1483.0  Friuli-Venezia Giulia
...         ...     ...                    ...
2475 2022-04-01  1963.0                  Lazio
2476 2022-04-01  1362.0  Friuli-Venezia Giulia
2477 2022-04-01  1674.0         Emilia-Romagna
2478 2022-04-01  1388.0                 Marche
2479 2022-04-01  1103.0                Abruzzo

I'm trying to build an LSTM for price prediction, but I don't know how to manage the Location categorical feature: do I have to use one-hot encoding or a groupby? What I want to predict is the price based on the location.
How can I achieve that? A Python solution is particularly appreciated.

Thanks in advance.

Upvotes: 3

Views: 399

Answers (1)

Salvatore Daniele Bianco
Salvatore Daniele Bianco

Reputation: 2701

Suppose my dataset (df) is analogous to yours:

          Date       Price  Location
0   2021-01-01  791.076890  Campania
1   2021-01-01  705.702464  Lombardia
2   2021-01-01  719.991382  Sicilia
3   2021-02-01  825.760917  Lombardia
4   2021-02-01  747.734309  Sicilia
...        ...         ...        ...
31  2021-11-01  886.874348  Lombardia
32  2021-11-01  935.040583  Campania
33  2021-12-01  771.165378  Sicilia
34  2021-12-01  952.255227  Campania
35  2021-12-01  939.754515  Lombardia

In my case I have a Price record for 3 regions (Campania, Lombardia, Sicilia) every month. My Idea is to treat the different region as different features, so I would transform df as:

df = df.set_index(["Date", "Location"]).Price.unstack()

Now my dataset is like:

Location    Campania    Lombardia   Sicilia
Date            
2021-01-01  791.076890  705.702464  719.991382
2021-02-01  758.872755  825.760917  747.734309
2021-03-01  880.038005  803.165998  837.738419
       ...         ...         ...         ...
2021-10-01  908.402345  805.081193  792.369610
2021-11-01  935.040583  886.874348  736.862025
2021-12-01  952.255227  939.754515  771.165378

Please, after this, make sure there are no NaN values (df.isna().sum()).

Now you can pass this data to a multi feature RNN (or LSTM), as made in this example, or to a multi-channel 1D-CNN (choosing an appropriate kernel size). The only problem in both cases could be the small size of the dataset, so try to not to over-parameterize the model (for example reducing the number of neurons and layers), otherwise the over-fitting will be unavoidable. About this you can test the model on the last 20% of your time-series:

from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, shuffle=False, test_size=.2)

The last part is to build a matching (X, Y) for the supervised learning, but this depends on what model are you using and what is your prediction task. Another example here.

Upvotes: 1

Related Questions