Reputation: 43
First of all I am not a programmer, but I am self-teaching me Deep Learning to undertake a real project with my own dataset. My situation can be broken down as follows:
I am trying to undertake a multiclass text classification project. I have a corpus with 1000 examples, each example with 4 possible labels(A1,A2,B1,B2) They are mutually exclusive. All the examples are in separate folders and separate .txt files.
After a lot of effort and some man tears I managed to put together this code:
import os
import string
import keras
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
import re
import numpy as np
import tensorflow as tf
from numpy import array
from sklearn.model_selection import KFold
from numpy.random import seed
for label in ["A1","A2","B1","B2"]:
for fname in os.listdir(directory):
if fname[-4:]==".txt":
f = open(os.path.join(directory, fname),encoding="cp1252")
if label == 'A1':
elif label=="A2":
elif label=="B1":
print("Corpus Length", len( root), "\n")
print("The total number of reviews in the train dataset is", len(texts),"\n")
stops = set(stopwords.words("english"))
print("The number of stopwords used in the beginning: ", len(stops),"\n")
print("The words removed from the corpus will be",stops,"\n")
## This adds new words or terms from words_to_add list to the stop_words
[stops.append(w) for w in words_to_add]
##This removes the words or terms from the words_to_remove list,
##so that they are no longer included in stopwords
[stops.remove(w) for w in words_to_remove ]
texts=[[w.lower() for w in word_tokenize("".join(str(review))) if w not in stops and w not in string.punctuation and len(w)>2 and w.isalpha()]for review in texts ]
print("costumized stopwords: ", stops,"\n")
print("count of costumized stopwords",len(stops),"\n")
#tokenizing the raw data
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
maxlen = 50
training_samples = 200
validation_samples = 10000
max_words = 10000
print("Sequence of tokens: ",tokens,"\n")
tokenizer = Tokenizer(num_words=max_words)
sequences = tokenizer.texts_to_sequences(texts)
print("Tokens:", sequences,"\n")
word_index = tokenizer.word_index
print("Unique tokens:",word_index,"\n")
print(' %s unique tokens in total.' % len(word_index,),"\n")
print("Unique tokens: ", word_index,"\n")
print("Dictionary of words and their count:", tokenizer.word_counts,"\n" )
print(" Number of docs/seqs used to fit the Tokenizer:", tokenizer.document_count,"\n")
print("Dictionary of words and how many documents each appeared in:",tokenizer.word_docs,"\n")
data = pad_sequences(sequences, maxlen=maxlen, padding="post")
print("padded data","\n")
#checking the encoding with a new document
text2="I like to study english in the morning and play games in the afternoon"
text2=[w.lower() for w in word_tokenize("".join(str(text2))) if w not in stops and w not in string.punctuation
and len(w)>2 and w.isalpha()]
sequences = tokenizer.texts_to_sequences([text2])
text2 = pad_sequences(sequences, maxlen=maxlen, padding="post")
print("padded text2","\n")
labels = np.asarray(labels)
print('Shape of data tensor:', data.shape,"\n")
print('Shape of label tensor:', labels.shape,"\n")
kf = KFold(n_splits=4, random_state=None, shuffle=True)
KFold(n_splits=4, random_state=None, shuffle=True)
for train_index, test_index in kf.split(data):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = data[train_index], data[test_index]
y_train, y_test = labels[train_index], labels[test_index]
#Pretrained embedding
glove_dir = 'D:\glove'
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'),encoding="utf-8")
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
print("Found %s words vectors fom GLOVE."% len(embeddings_index))
#Preparing the Glove word-embeddings matrix to pass to the embedding layer(max_words, embedding_dim)
embedding_dim = 100
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
if i < max_words:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
# define vocabulary size (largest integer value)
# define model
from keras.models import Sequential
from keras.layers import Embedding,Flatten,Dense
from keras import layers
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))#vocabulary size + the size of glove version +max len of input documents.
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#Loading pretrained word embeddings and Freezing the Embedding layer
model.layers[0].trainable = False
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network, y_train, epochs=6,verbose=2)
# evaluate
loss, acc = model.evaluate(X_test, y_test, verbose=0)
print('Test Accuracy: %f' % (acc*100))
However, I am getting this error:
Traceback (most recent call last):
File "D:/", line 177, in <module>, y_train, epochs=6,verbose=2)
File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\keras\engine\", line 1154, in fit
File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\keras\engine\", line 642, in _standardize_user_data
y, self._feed_loss_fns, feed_output_shapes)
File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\keras\engine\", line 284, in check_loss_and_target_compatibility
' while using as loss `categorical_crossentropy`. '
ValueError: You are passing a target array of shape (3, 1) while using as loss `categorical_crossentropy`. `categorical_crossentropy` expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:
from keras.utils import to_categorical
y_binary = to_categorical(y_int)
Alternatively, you can use the loss function `sparse_categorical_crossentropy` instead, which does expect integer targets.
I tried everything the error message says, but to no avail. After some research I came to the conclusion that the model is not trying to predict multiple classes, that's why the categorical_crossentropy
loss is not being accepted. I then realized that, if I changed it for binary cross-entropy
the error goes away, which is really a confirmation that this is not working as a multiclass classification model.
What can I do to adjust my code to make it work as intended? Am I S*it out of luck and have to start a whole different project?
Any type of guidance will be of immense help for me and my mental health.
Upvotes: 0
Views: 2309
Reputation: 56377
You should make two changes. First the number of neurons in the output of your network should match the number of classes, and use the softmax
model.add(Dense(4, activation='softmax'))
Then you should use the sparse_categorical_crossentropy
loss as you are not one-hot encoding the labels:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Then the model should be able to train without errors.
Upvotes: 1