Reputation: 111
Suppose I have the following data-set X with 2 features and Y labels .
X = [[0.3, 0.1], [0.2, 0.9], [0.4, 0.0]]
Y = [0, 1, 0]
# split into input (X) and output (Y) variables
X = dataset[:, 0:2] #X features are from the first column and the 50th column
Y = dataset[:, 2]
model = Sequential()
model.add(Embedding(2, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, Y)
It works, but I wanted o know more about parameter_1, parameter_2, parameter_3 that go in
Embedding(parameter_1, parameter_2, input_length=parameter_2)
P.S, I just put in random stuff and don't know what I am doing.
What would be the proper parameters to fill in Embedding() given the data set I described above?
Upvotes: 0
Views: 2282
Reputation: 11543
Alright, following more precise questions in the comments, here is the explaination.
An embedding layer is usually used to embed words so I will use a "red line example" with words, but you can think of them as categorical features. The embedding layer is useful indeed to represent words (categorical features) as vectors in a continuous vector space.
When you have a text, you will tokenize your words and assign them a number. They become then categorical features labelled with an index. You will have for example the sentence " I embed stuff" becoming the list of categorical objects [2, 1, 3]
where a dictionnary maps the index to each words : {1: "embed", 2: "I", 3: "stuff", 4: "some_other_words", 0:"<pad>"}
When you use a neural network or a continuous mathematical framework, those discrete objects (=categories) are unordered, there is no sense in 2 > 1 when you talk about your words, those are not "numerical values", they are categories. So you want to make them become numbers, to embed them in a vector space.
This is precisely what the Embedding()
layer does, it maps every indexes to a word. So to do that, there are three main parameters to define :
input_dim = 5
. The reason why it is called a "dimension" is because under the hood, keras is transforming the index number into a one-hot vector of dimension = the number of different elements. For example, the word "stuff" which is index 3 will be transformed into the 5 dimesions vector : [0 0 0 1 0]
before being embedded. This is why your inputs should be integer, they are indices representing where the 1 is in the one-hot vector.output_dim = 2
. Our 5 words will be living in a 2D space.input_length=3
.The reason why you usually have the embedding layer as first layer is because it takes integers inputs, layers in neural networks return real values, so it wouldn't work.
So to summarize, what comes in the the layer is a sequence of indices : [2, 1, 3]
in our example. And what comes out is the embedded vector corresponding to each index. This might be something like [[0.2, 0.4], [-1.2, 0.3], [-0.5, -0.8]]
.
And to come back to your example, the input should be a list of samples, samples being lists of indices. There is no use to embed features that are already real values, values which have a mathematical sense already, the model can understand it, as opposed to categorical values.
Is it clearer now? :)
Upvotes: 1