NatalieK
NatalieK

Reputation: 3

How to fix (do better) text classification model with using word2vec

I'm the freshman in Machine Learning and Neural Network. I've got the problem with text classification. I use LSTM NN architecture system with Keras library. My model every time reach the results about 97%. I got the database with something about 1 million records, where 600k of them are positive and 400k are negative. I got also 2 labeled classes as 0 (for negative) and 1 (for positive). My database is split for training database and tests database in relation 80:20. For the NN input, I use Word2Vec trained on PubMed articles. My network architecure:

model = Sequential()
model.add(emb_layer)
model.add(LSTM(64, dropout =0.5))
model.add(Dense(2))
model.add(Activation(‘softmax’)
model.compile(optimizer=’rmsprop’, loss=’binary_crossentropy’, metrics=[‘accuracy’])
model.fit(X_train, y_train, epochs=50, batch_size=32)

How can I fix (do better) my NN created model in this kind of text classification?

Upvotes: 0

Views: 710

Answers (1)

Szymon Płotka
Szymon Płotka

Reputation: 36

The problem with which we are dealing here is called overfitting. First of all, make sure your input data is properly cleaned. One of the principles of machine learning is: ‘Garbage In, Garbage Out”. Next, you should balance your data collection, for example on 400k positive and 400k negative records. In sequence, the data set should be divided into a training, test and validation set (60%:20%:20%), for example using scikit-learn library, as in the following example:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2)

Then I would use a different neural network architecture and try to optimize the parameters. Personally, I would suggest using a 2-layer LSTM neural network or a combination of a convolutional and recurrent neural network (faster and reading articles that give better results).

1) 2-layer LSTM:

model = Sequential()
model.add(emb_layer)
model.add(LSTM(64, dropout=0.5, recurrent_dropout=0.5, return_sequences=True)
model.add(LSTM(64, dropout=0.5, recurrent_dropout=0.5))
model.add(Dense(2))
model.add(Activation(‘sigmoid’))

You can try using 2 layers with 64 hidden neurons, add recurrent_dropout parameter. The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

2) CNN + LSTM

model = Sequential()
model.add(emb_layer)
model.add(Convolution1D(32, 3, padding=’same’))
model.add(Activation(‘relu’))
model.add(MaxPool1D(pool_size=2))
model.add(Dropout(0.5))
model.add(LSTM(32, dropout(0.5, recurrent_dropout=0.5, return_sequences=True))
model.add(LSTM(64, dropout(0.5, recurrent_dropout=0.5))
model.add(Dense(2))
model.add(Activation(‘sigmoid’))

You can try using combination of a CNN and RNN. In this architecture, the model learns faster (up to 5 times faster).

Then, in both cases, you need to apply optimization, loss function.

A good optimizer for both cases is the "Adam" optimizer.

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In the last step, we validate our network on the validation set. In addition, we use callback, which will stop the network learning process, in case when, for example, in 3 more iterations, there are no changes in the accuracy of the classification.

from keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(patience=3)

model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_val, y_val), callbacks=[early_stopping])

We can also control the overfitting using graphs. If you want to see how to do it, check here.

If you need further help, let me know in a comment.

Upvotes: 2

Related Questions