3eyedRaven
3eyedRaven

Reputation: 111

LSTM's for Binary classification in Keras?

Suppose I have the following data-set X with 2 features and Y labels .

X = [[0.3, 0.1], [0.2, 0.9], [0.4, 0.0]]

Y = [0, 1, 0]

    # split into input (X) and output (Y) variables
X = dataset[:, 0:2] #X features are from the first column and the 50th column
Y = dataset[:, 2] 



model = Sequential()
model.add(Embedding(2, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, Y)

It works, but I wanted o know more about parameter_1, parameter_2, parameter_3 that go in

Embedding(parameter_1, parameter_2, input_length=parameter_2)

P.S, I just put in random stuff and don't know what I am doing.

What would be the proper parameters to fill in Embedding() given the data set I described above?

Upvotes: 0

Views: 2282

Answers (1)

Nassim Ben
Nassim Ben

Reputation: 11543

Alright, following more precise questions in the comments, here is the explaination.

An embedding layer is usually used to embed words so I will use a "red line example" with words, but you can think of them as categorical features. The embedding layer is useful indeed to represent words (categorical features) as vectors in a continuous vector space.

When you have a text, you will tokenize your words and assign them a number. They become then categorical features labelled with an index. You will have for example the sentence " I embed stuff" becoming the list of categorical objects [2, 1, 3] where a dictionnary maps the index to each words : {1: "embed", 2: "I", 3: "stuff", 4: "some_other_words", 0:"<pad>"}

When you use a neural network or a continuous mathematical framework, those discrete objects (=categories) are unordered, there is no sense in 2 > 1 when you talk about your words, those are not "numerical values", they are categories. So you want to make them become numbers, to embed them in a vector space.

This is precisely what the Embedding() layer does, it maps every indexes to a word. So to do that, there are three main parameters to define :

  1. How many indices you want to use in total. This is the number of words you have in your vocabulary, or the number of categories that the categorical feature you want to encode has. This is the input_dim feature. In our little example, we have 5 words in the vocabulary (indices from 0 to 4), so we will have input_dim = 5. The reason why it is called a "dimension" is because under the hood, keras is transforming the index number into a one-hot vector of dimension = the number of different elements. For example, the word "stuff" which is index 3 will be transformed into the 5 dimesions vector : [0 0 0 1 0] before being embedded. This is why your inputs should be integer, they are indices representing where the 1 is in the one-hot vector.
  2. How big do you want your output vectors. This is the size of the vector space where your features will live. The parameter is output_dim. if you don't have a lot of words in your vocabulary (different categories for your features), this number should be low, in our case we will set it to output_dim = 2. Our 5 words will be living in a 2D space.
  3. As embedding layers are often the firsts in a Neural Network, you need to specify what is the number of words that you have in the samples. This will be the input_length. Our sample was a 3 words phrase so input_length=3.

The reason why you usually have the embedding layer as first layer is because it takes integers inputs, layers in neural networks return real values, so it wouldn't work.

So to summarize, what comes in the the layer is a sequence of indices : [2, 1, 3] in our example. And what comes out is the embedded vector corresponding to each index. This might be something like [[0.2, 0.4], [-1.2, 0.3], [-0.5, -0.8]].

And to come back to your example, the input should be a list of samples, samples being lists of indices. There is no use to embed features that are already real values, values which have a mathematical sense already, the model can understand it, as opposed to categorical values.

Is it clearer now? :)

Upvotes: 1

Related Questions