Reputation: 4608
I have a binary classification
problem where I need to predict the potential future trendy/popular products based on customer interactions during 2010-2015
.
Currently, my dataset includes 1000 products
and each product is labelled as 0
or 1
(i.e. binary classification). The label was decided based on customer interactions during 2016-2018
.
I am calculating how centrality measures changed over time for each product
during 2010-2015
as the features for my binary classification problem. For example, consider the below figure that shows how degree centrality
changed over time for each product.
More specifically, I analyse the change of following centrality measures
as the features for my binary classification problem.
degree centrality
of each good changed from 2010-2016 (see the above figure)betweenness centrality
of each good changed from 2010-2016closeness centrality
of each good changed from 2010-2016eigenvector centrality
of each good changed from 2010-2016In a nutshell, my data looks as follows.
product, change_of_degree_centrality, change_of_betweenness_centrality, change_of_closenss_centrality, change_of_eigenvector_centrality, Label
item_1, [1.2, 2.5, 3.7, 4.2, 5.6, 8.8], [8.8, 4.6, 3.2, 9.2, 7.8, 8.6], …, 1
item_2, [5.2, 4.5, 3.7, 2.2, 1.6, 0.8], [1.5, 0, 1.2, 1.9, 2.5, 1.2], …, 0
and so on.
I wanted to use deep learning model to solve my issue. When reading tutorials, I realised that LSTM
suits my problem.
So, I am using the below mentioned model for my classification.
model = Sequential()
model.add(LSTM(10, input_shape=(6,4))) #where 6 is length of centrality sequence and 4 is types of centrality (i.e. degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality)
model.add(Dense(32))
model.add(Dense(1, activation=’sigmoid’))
model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
Since, I have a small dataset I wanted to perform 10-fold cross-validation. So, I am using KerasClassifier
as follows by following this tutorial.
print(features.shape) #(1000,6,4)
print(target.shape) #(1000)
# Create function returning a compiled network
def create_network():
model = Sequential()
model.add(LSTM(10, input_shape=(6,4)))
model.add(Dense(32))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
# Wrap Keras model so it can be used by scikit-learn
neural_network = KerasClassifier(build_fn=create_network,
epochs=10,
batch_size=100,
verbose=0)
print(cross_val_score(neural_network, features, target, cv=5))
However, I noted that it is wrong to use cross validation
with LSTM (e.g., this tutorial, this question).
However, I am not clear if this is applicable to me as I am only doing a
binary classification
predition to identify products that would be trendy/popular in future (not a forecasting).
I think the data in my problem setting is divided by point-wise in the cross-validation, but not time-wise.
i.e. (point-wise)
1st fold training:
item_1, item2, ........, item_799, item_800
1st fold testing:
item 801, ........, item_1000
not (time-wise)
1st fold training:
2010, 2011, ........, 2015
1st fold testing:
2016, ........, 2018
Due to this fact, I am assuming that using cross validation
is correct in my problem.
Please let me know a suitable way to use cross-validation according to my problem and dataset.
NOTE: I am not limited to LSTM and happy to explore other models as well.
I am happy to provide more details if needed.
Upvotes: 0
Views: 1191
Reputation: 186
Maybe you misunderstand the concept ,the KerasClassifier is suite for LSTM
base on those link you give , it just say the cross-valid not suite for time-series
row-grow
but LSTM is clomn grow n time series
Upvotes: 1
Reputation: 467
There are many types of cross validation similar to how there are many types of neural networks. In your case you are trying to use kfold cross validation.
In the question you linked, it correctly states that kfold cross validation should not be used with time series data. You can’t accurately evaluate your model if you are training on data and then testing on data that occurred before the training data.
However, other forms of cross validation (such as the mentioned sliding window or expanding window) will still work with your time series data. There is a function in sklearn that splits the data using the expanding window method. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html
With all that said, I am not sure if you are really using time series data. If you simply have the centrality scores for each year as a separate feature, then the order of your data does not matter since each item is only one data point (assuming that the scores of one item don’t impact another). In that case you can use kfold cross validation and other networks that work with iid data. You could even use non neural networks such as SVMs or decision trees.
Upvotes: 1