Reputation: 1419

How to use keras embedding layer with 3D tensor input?

I am facing difficulty in using Keras embedding layer with one hot encoding of my input data.

Following is the toy code.

Import packages

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.optimizers import Adam
import matplotlib.pyplot as plt
import numpy as np
import openpyxl
import pandas as pd
from keras.callbacks import ModelCheckpoint
from keras.callbacks import ReduceLROnPlateau

The input data is text based as follows.

Train and Test data

X_train_orignal= np.array(['OC(=O)C1=C(Cl)C=CC=C1Cl', 'OC(=O)C1=C(Cl)C=C(Cl)C=C1Cl',
       'OC(=O)C1=CC=CC(=C1Cl)Cl', 'OC(=O)C1=CC(=CC=C1Cl)Cl',
       'OC1=C(C=C(C=C1)[N+]([O-])=O)[N+]([O-])=O'])

X_test_orignal=np.array(['OC(=O)C1=CC=C(Cl)C=C1Cl', 'CCOC(N)=O',
       'OC1=C(Cl)C(=C(Cl)C=C1Cl)Cl'])

Y_train=np.array(([[2.33],
       [2.59],
       [2.59],
       [2.54],
       [4.06]]))

Y_test=np.array([[2.20],
   [2.81],
   [2.00]])

Creating dictionaries

Now i create two dictionaries, characters to index vice. The unique character number is stored in len(charset) and maximum length of the string along with 5 additional characters is stored in embed. The start of each string will be padded with ! and end will be E.

charset = set("".join(list(X_train_orignal))+"!E")
char_to_int = dict((c,i) for i,c in enumerate(charset))
int_to_char = dict((i,c) for i,c in enumerate(charset))
embed = max([len(smile) for smile in X_train_orignal]) + 5
print (str(charset))
print(len(charset), embed)

One hot encoding

I convert all the train data into one hot encoding as follows.

def vectorize(smiles):
        one_hot =  np.zeros((smiles.shape[0], embed , len(charset)),dtype=np.int8)
        for i,smile in enumerate(smiles):
            #encode the startchar
            one_hot[i,0,char_to_int["!"]] = 1
            #encode the rest of the chars
            for j,c in enumerate(smile):
                one_hot[i,j+1,char_to_int[c]] = 1
            #Encode endchar
            one_hot[i,len(smile)+1:,char_to_int["E"]] = 1

        return one_hot[:,0:-1,:]

X_train = vectorize(X_train_orignal)
print(X_train.shape)
X_test = vectorize(X_test_orignal)
print(X_test.shape)

When it converts the input train data into one hot encoding, the shape of the one hot encoded data becomes (5, 44, 14) for train and (3, 44, 14) for test. For train, there are 5 example, 0-44 is the maximum length and 14 are the unique characters. The examples for which there are less number of characters, are padded with E till the maximum length.

Verifying the correct padding Following is the code to verify if we have done the padding rightly.

mol_str_train=[]
mol_str_test=[]
for x in range(5):

    mol_str_train.append("".join([int_to_char[idx] for idx in np.argmax(X_train[x,:,:], axis=1)]))

for x in range(3):
    mol_str_test.append("".join([int_to_char[idx] for idx in np.argmax(X_test[x,:,:], axis=1)]))

and let's see, how the train set looks like.

mol_str_train

['!OC(=O)C1=C(Cl)C=CC=C1ClEEEEEEEEEEEEEEEEEEEE',
 '!OC(=O)C1=C(Cl)C=C(Cl)C=C1ClEEEEEEEEEEEEEEEE',
 '!OC(=O)C1=CC=CC(=C1Cl)ClEEEEEEEEEEEEEEEEEEEE',
 '!OC(=O)C1=CC(=CC=C1Cl)ClEEEEEEEEEEEEEEEEEEEE',
 '!OC1=C(C=C(C=C1)[N+]([O-])=O)[N+]([O-])=OEEE']

Now is the time to build model.

Model

model = Sequential()
model.add(Embedding(len(charset), 10, input_length=embed))
model.add(Flatten())
model.add(Dense(1, activation='linear'))

def coeff_determination(y_true, y_pred):
    from keras import backend as K
    SS_res =  K.sum(K.square( y_true-y_pred ))
    SS_tot = K.sum(K.square( y_true - K.mean(y_true) ) )
    return ( 1 - SS_res/(SS_tot + K.epsilon()) )

def get_lr_metric(optimizer):
    def lr(y_true, y_pred):
        return optimizer.lr
    return lr


optimizer = Adam(lr=0.00025)
lr_metric = get_lr_metric(optimizer)
model.compile(loss="mse", optimizer=optimizer, metrics=[coeff_determination, lr_metric])



callbacks_list = [
    ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, min_lr=1e-15, verbose=1, mode='auto',cooldown=0),
    ModelCheckpoint(filepath="weights.best.hdf5", monitor='val_loss', save_best_only=True, verbose=1, mode='auto')]


history =model.fit(x=X_train, y=Y_train,
                              batch_size=1,
                              epochs=10,
                              validation_data=(X_test,Y_test),
                              callbacks=callbacks_list)

Error

ValueError: Error when checking input: expected embedding_3_input to have 2 dimensions, but got array with shape (5, 44, 14)

The embedding layer expects two dimensional array. How can I deal with this issue so that it can accept the one hot vector encoded data.

All the above code can be run.

Upvotes: 3

How to use keras embedding layer with 3D tensor input?

Answers (2)

Related Questions