Jay
Jay

Reputation: 67

What's the best way to classify text data in ML?

Let's say i have a dataset consisting of a review column with exactly 100 words for each review, then it may be easy to train my model as i can simply tokenize each of the 100 words for each reviews then convert it into a numerical array and then feed it into a Sequential model with input_shape=(1,100). But in the real world, reviews are never the same size. If I use a function such as CountVectorizer, then the structure of the sentence is not reserved, and one hot encoding may not be efficient enough.

So what is the proper way to preprocess this particular dataset so that i feed it into a trainable NN

Upvotes: 1

Views: 58

Answers (1)

Stefan
Stefan

Reputation: 65

A common way to represent text as vectors is by utilizing word embeddings. The main idea is that you used a large text corpus to compute vector representations of all words occurring in that dataset. So now for each review, you could run the following algorithm to compute its vector representation:

  1. For each word in the review, check if a word embedding exists (in other words, that word occurred in the large training corpus) and if it does, add its vector representation to the representation of the review
  2. Once you summed up the vector representations of all words, you compute the average embedding by dividing the summed review vector by the number of words in the document and this results in the final vector representation for that document
  3. This vector can now be fed into a trainable NN

Before performing steps 1-3, you could also apply more preprocessing steps and remove fill words such as "and", "or", etc. as they usually carry no meaning, you could convert words to lower case and apply other standard NLP (natural language processing techniques) which could affect the vector representation of the reviews. But the key idea is to sum up the word vectors of a review and use its averaged vector as the representation of the review. By averaging, the length of the reviews is unimportant. Similarly, in word embeddings, the dimensionality of the word vectors is fixed (100D, 200D, ...), so you can experiment with the most suitable dimensionality.

Note that there are many different models available that compute word embeddings, so you could choose any of them. One that is nicely integrated into Python is word2vec. And a state-of-the-art model that is currently being used by Google is called BERT.

Upvotes: 3

Related Questions