How to use LSTM for sequence classification using KerasClassifier

Question

I have a binary classification problem where I need to predict the potential future trendy/popular products based on customer interactions during 2010-2015.

Currently, my dataset includes 1000 products and each product is labelled as 0 or 1 (i.e. binary classification). The label was decided based on customer interactions during 2016-2018.

I am calculating how centrality measures changed over time for each product during 2010-2015 as the features for my binary classification problem. For example, consider the below figure that shows how degree centrality changed over time for each product.

More specifically, I analyse the change of following centrality measures as the features for my binary classification problem.

how degree centrality of each good changed from 2010-2016 (see the above figure)
how betweenness centrality of each good changed from 2010-2016
how closeness centrality of each good changed from 2010-2016
how eigenvector centrality of each good changed from 2010-2016

In a nutshell, my data looks as follows.

product, change_of_degree_centrality, change_of_betweenness_centrality, change_of_closenss_centrality, change_of_eigenvector_centrality, Label
item_1, [1.2, 2.5, 3.7, 4.2, 5.6, 8.8], [8.8, 4.6, 3.2, 9.2, 7.8, 8.6], …, 1
item_2, [5.2, 4.5, 3.7, 2.2, 1.6, 0.8], [1.5, 0, 1.2, 1.9, 2.5, 1.2], …, 0
and so on.

I wanted to use deep learning model to solve my issue. When reading tutorials, I realised that LSTM suits my problem.

So, I am using the below mentioned model for my classification.

model = Sequential()
model.add(LSTM(10, input_shape=(6,4))) #where 6 is length of centrality sequence and 4 is types of centrality (i.e. degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality)
model.add(Dense(32))
model.add(Dense(1, activation=’sigmoid’))
model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

Since, I have a small dataset I wanted to perform 10-fold cross-validation. So, I am using KerasClassifier as follows by following this tutorial.

print(features.shape) #(1000,6,4)
print(target.shape) #(1000) 

# Create function returning a compiled network
def create_network():
    model = Sequential()
    model.add(LSTM(10, input_shape=(6,4)))
    model.add(Dense(32))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])     

    return model

# Wrap Keras model so it can be used by scikit-learn
neural_network = KerasClassifier(build_fn=create_network, 
                                 epochs=10, 
                                 batch_size=100, 
                                 verbose=0)

print(cross_val_score(neural_network, features, target, cv=5))

However, I noted that it is wrong to use cross validation with LSTM (e.g., this tutorial, this question).

However, I am not clear if this is applicable to me as I am only doing a binary classification predition to identify products that would be trendy/popular in future (not a forecasting).

I think the data in my problem setting is divided by point-wise in the cross-validation, but not time-wise.

i.e. (point-wise)

1st fold training:
item_1, item2, ........, item_799, item_800

1st fold testing:
item 801, ........, item_1000

not (time-wise)

1st fold training:
2010, 2011, ........, 2015

1st fold testing:
2016, ........, 2018

Due to this fact, I am assuming that using cross validation is correct in my problem.

Please let me know a suitable way to use cross-validation according to my problem and dataset.

NOTE: I am not limited to LSTM and happy to explore other models as well.

I am happy to provide more details if needed.

Kyle · Accepted Answer

There are many types of cross validation similar to how there are many types of neural networks. In your case you are trying to use kfold cross validation.

In the question you linked, it correctly states that kfold cross validation should not be used with time series data. You can’t accurately evaluate your model if you are training on data and then testing on data that occurred before the training data.

However, other forms of cross validation (such as the mentioned sliding window or expanding window) will still work with your time series data. There is a function in sklearn that splits the data using the expanding window method. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

With all that said, I am not sure if you are really using time series data. If you simply have the centrality scores for each year as a separate feature, then the order of your data does not matter since each item is only one data point (assuming that the scores of one item don’t impact another). In that case you can use kfold cross validation and other networks that work with iid data. You could even use non neural networks such as SVMs or decision trees.

How to use LSTM for sequence classification using KerasClassifier

Answers (2)

Related Questions