Gabriel
Gabriel

Reputation: 11

Embedding in TensorFlow Functional API with 200.000 different words dictionary

I have checked several questions in Stakoverflow and tutorials about Keras and TensorFlow embedding but I have found no answer that works for me. I explain.

I have 200.000 words dictionary. With 10376 unique "words". They represent Cellular device ID. IMEI. In this particular instance, I want to process them using Keras Functional API and then merge with the numerical data eventually when I solve this.

But I can pass the first level which part which is the embedding.

Here the code

#example of device
0         jg4M/taYRc2cBJPGa8c8vw==
1         jg4M/taYRc2cBJPGa8c8vw==
2         jg4M/taYRc2cBJPGa8c8vw==
3         chIM3a44QxatbmmjyBFGDQ==
4         PdhyfpkIT8Weslb54thwuQ==
5         lrDcRnK7RtKkvaqaYjliBQ==

#lenght of the device 
device_len = len(device)
device_len
200000
#uniques device inside the 200000
top_words = len(np.unique(device))
top_words
10376

#keras encoded
encoded_docs = [one_hot(d, top_words) for d in device]

#max length of the vector for each word
max_length = 2
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_doc)

[[10269  9475]
 [10269  9475]
 [10269  9475]
 ...
 [ 1340  2630]
 [ 7270     0]
 [ 2364  9298]]

#converted to tensors
padded_docs = tf.convert_to_tensor(padded_docs)
sess = tf.InteractiveSession()  
print(padded_docs.eval())
sess.close()

#here is the networks
top_words = 10376
embedding_vector_length = 2
x = Embedding(top_words, embedding_vector_length)(padded_docs)
x = Dense(2, activation='sigmoid')(x)
modelx = Model(inputs=padded_docs, outputs = x)
ValueError: Input tensors to a Model must come from `keras.layers.Input`. Received: Tensor("Const:0", shape=(200000, 2), dtype=int32) (missing previous layer metadata).

I check similar questions and answers but I can't find something that works for me.

If someone can help me will be greatly appreciated

Thank you very much indeed.

Upvotes: 0

Views: 472

Answers (2)

Daniel Möller
Daniel Möller

Reputation: 86600

You need an Input for your model. padded_docs is not a tensor, it's "data".

 from keras.layers import Input

 inputs = Input((doc_length,))
 x = Embedding(top_words, embedding_vector_length)(inputs)
 x = Dense(2, activation='sigmoid')(x)

 modelx = Model(inputs=inputs, outputs = x)

Also, you need that padded_docs be made of "integers", not of one-hot encodings. The Embedding layer needs integers.

It's important to notice that you will not pass it as a tensor, but as a regular numpy array, to train with model.fit.

So you need to remove the one_hot and convert_to_tensor parts.

Then you will do a model.fit(padded_docs, whatever_outputs, .....etc....)

Upvotes: 1

The Guy with The Hat
The Guy with The Hat

Reputation: 11132

When creating a Model, the input should be an Input layer, not a tensor.

input = keras.layers.Input((max_length,))
x = Embedding(top_words, embedding_vector_length)(input)
x = Dense(2, activation='sigmoid')(x)
modelx = Model(inputs=input, outputs=x)

Upvotes: 0

Related Questions