Oni
Oni

Reputation: 682

How does one create a dense vector from a sentence as input to a neural net?

I'm having trouble figuring out how one converts a sentence into a dense vector as input to a neural network, specifically to test whether or not a sentence is 'liked' or 'not liked' given a training set.

I've had some luck with Support Vector Machines. Using NLTK and scikit-learn I worked out that, by default, scikit-learn uses sklearn.feature_extraction.DictVectorizer. This seems to create a large matrix with a rank the same size as the number of unique words in the dataset and a 1.0 if that word appears in the sample, or a 0.0 if not.

Here is the code I have already. It seems to work 'reasonably' well but I'm looking at training some kind of Recurrent or Convolutional Neural Network to do the same sort of thing.

''' A small example of training a support vector machine 
classifier to detect whether or not I like or dislike
a particular tweet '''

import sqlite3
import nltk

from sklearn.svm import LinearSVC, NuSVC, SVC, SVR
from nltk.classify.scikitlearn import SklearnClassifier

liked_ngrams = []
disliked_ngrams = []

# Limits on the dataset. Dividing into training vs test
pos_cutoff = 0
neg_cutoff = 0

# Prep a tweet from our db, setting the features
def prep_tweet(row):
   tokens = nltk.word_tokenize(row[0])


 tokens = [token.lower() for token in tokens if len(token) > 1]
  bi_tokens = nltk.bigrams(tokens)
  tri_tokens = nltk.trigrams(tokens)
  a =[(word, True) for word in tokens]
  #a += [(word, True) for word in bi_tokens]
  #a += [(word, True) for word in tri_tokens]
  return dict(a)

# test out classifier on some existing tweets

def test_classifier(classifier):
  conn = sqlite3.connect('fuchikoma.db')
  cursor = conn.cursor()

  # start with likes
  idx = 0
  total_positive = 0
  total_negative = 0

  for row in cursor.execute('SELECT * FROM likes ORDER BY date(date) DESC').fetchall():
    if idx > pos_cutoff:

      test_set = []
      try:
        a = prep_tweet(row)
        if len(a) > 0:
          if classifier.classify(a) == 'neg':
            total_negative += 1
          else:
            total_positive += 1

      except UnicodeDecodeError:
        pass

    idx += 1

  print("Checking positive Tweets")
  p = float(total_positive) / float(total_positive + total_negative) * 100.0
  print ("Results: Positive: " + str(total_positive) + " Negative: " + str(total_negative) + " Percentage Correct: " + str(p) + "%")

  idx = 0
  total_positive = 0
  total_negative = 0

  for row in cursor.execute('SELECT * FROM dislikes ORDER BY date(date) DESC').fetchall():
    if idx > neg_cutoff:

      test_set = []
      try:
        a = prep_tweet(row)
        if len(a) > 0:
          if classifier.classify(a) == 'pos':
            total_negative += 1
          else:
            total_positive += 1

      except UnicodeDecodeError:
        pass

    idx += 1

  print("Checking negative Tweets")
  p = float(total_positive) / float(total_positive + total_negative) * 100.0
  print ("Results: Positive: " + str(total_positive) + " Negative: " + str(total_negative) + " Percentage Correct: " + str(p) + "%")


# http://streamhacker.com/2010/10/25/training-binary-text-classifiers-nltk-trainer/
# http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/comment-page-2/
def train_classifier():
  conn = sqlite3.connect('fuchikoma.db')
  cursor = conn.cursor()

  # start with likes
  for row in cursor.execute('SELECT * FROM likes ORDER BY date(date) DESC').fetchall():
    try:
      a = prep_tweet(row)
      if len(a) > 0:
        liked_ngrams.append(a)

    except UnicodeDecodeError:
       pass

  #from sklearn.feature_extraction import DictVectorizer
  #v = DictVectorizer(sparse=True)
  #X = v.fit_transform(liked_ngrams)
  #print(X)
  #print(v.inverse_transform(X))

  # now dislikes
  for row in cursor.execute('SELECT * FROM dislikes ORDER BY date(date) DESC').fetchall():
    try:
      a = prep_tweet(row)
      if len(a) > 0:
        disliked_ngrams.append(a)

    except UnicodeDecodeError:
       pass

  pos_cutoff = int(len(liked_ngrams)*0.75)
  neg_cutoff = int(len(disliked_ngrams)*0.75)

  training_set = [ (feat, 'pos') for feat in liked_ngrams[:pos_cutoff] ]
  training_set += [ (feat, 'neg') for feat in disliked_ngrams[:neg_cutoff]]

  # Finally, train the classifier and return
  # By default, this appears to create clusters and vectors using 
  # sklearn.feature_extraction.DictVectorizer with a sparse feature set it would appear
  classif = SklearnClassifier(LinearSVC())
  classif.train(training_set)

  return classif

if __name__ == "__main__":
  test_classifier(train_classifier())

I've been pointed at http://www.aclweb.org/anthology/P14-1062 but its a bit too hard for me to understand fully. I've seen plenty of neural nets and deep learning for images, but comparatively little on text. Can someone please point me at some easier, introductory work for this sort of thing please? Cheers Ben

Upvotes: 1

Views: 1246

Answers (2)

Steven Du
Steven Du

Reputation: 1691

There are many ways:

1) Vector space model, after you have that dense vector, you can apply a SVD eg with 300 components , then you will have a vector in 300 dimensions. Do check out this example, written by me for Semeval 2016.

2) Using CNN with word embedding layer(randomized or pretrained) , check out Keras's docs and examples:

  1. cnn

  2. cnn lstm

  3. fasttext

  4. pretrained embeddings.

3) To get a full picture of the problem, please check out this slide [Neural Text Embedings for Information Retrieval tutorial at WSDM 2017] (http://www.slideshare.net/BhaskarMitra3/neural-text-embeddings-for-information-retrieval-wsdm-2017)

Upvotes: 1

Related Questions