Reputation: 1731
I am following a course on deep learning and I have a model built with keras. After data preprocessing and encoding of categorical data, I get an array of shape (12500,)
as the input to the model. This input makes the model training process slower and laggy. Is there an approach to minimize the dimensionality of the inputs?
Inputs are categorised geo coordinates, weather info, time, distance and I am trying to predict the travel time between two geo coordinates.
Original dataset has 8 features and 5 of them are categorical. I used onehot encoding to encode the above categorical data. geo coordinates have 6000 categories, weather 15 categories time has 96 categories. Likewise all together after encoding with onehot encoding I got an array of shape (12500,)
as the input to model.
Upvotes: 3
Views: 1673
Reputation: 53758
When the number of categories is large, one-hot encoding becomes too inefficient. The extreme example of this is processing of sentences in a natural language: in this task the vocabulary often has 100k or even more words. Obviously the translation of a 10-word sentence into a [10, 100000]
matrix, almost all of which is zero, would be a waste of memory.
What the researches use instead is the embedding layer, which learns a dense representation of a categorical feature. In case of words, it's called word embedding, e.g. word2vec. This representation is much smaller, something like 100-dimensional, and makes the rest of the network to work efficiently with 100-d input vectors, rather than 100000-d vectors.
In keras, it's implemented by an Embedding
layer, which I think would work perfectly for your geo
and time
features, while others may probably work fine with one-hot encoding. This means that your model is no longer Sequential
, but rather has several inputs, some of which go through the embedding layer. The main model would take the concatenation of learned representations and do the regression inference.
Upvotes: 4
Reputation: 516
You can use PCA to do dimensionality reduction. It removes co-related variables and makes sure that high variances exits in the data.
Upvotes: 1