Reputation: 682
I'm having trouble figuring out how one converts a sentence into a dense vector as input to a neural network, specifically to test whether or not a sentence is 'liked' or 'not liked' given a training set.
I've had some luck with Support Vector Machines. Using NLTK and scikit-learn I worked out that, by default, scikit-learn uses sklearn.feature_extraction.DictVectorizer
. This seems to create a large matrix with a rank the same size as the number of unique words in the dataset and a 1.0 if that word appears in the sample, or a 0.0 if not.
Here is the code I have already. It seems to work 'reasonably' well but I'm looking at training some kind of Recurrent or Convolutional Neural Network to do the same sort of thing.
''' A small example of training a support vector machine
classifier to detect whether or not I like or dislike
a particular tweet '''
import sqlite3
import nltk
from sklearn.svm import LinearSVC, NuSVC, SVC, SVR
from nltk.classify.scikitlearn import SklearnClassifier
liked_ngrams = []
disliked_ngrams = []
# Limits on the dataset. Dividing into training vs test
pos_cutoff = 0
neg_cutoff = 0
# Prep a tweet from our db, setting the features
def prep_tweet(row):
tokens = nltk.word_tokenize(row[0])
tokens = [token.lower() for token in tokens if len(token) > 1]
bi_tokens = nltk.bigrams(tokens)
tri_tokens = nltk.trigrams(tokens)
a =[(word, True) for word in tokens]
#a += [(word, True) for word in bi_tokens]
#a += [(word, True) for word in tri_tokens]
return dict(a)
# test out classifier on some existing tweets
def test_classifier(classifier):
conn = sqlite3.connect('fuchikoma.db')
cursor = conn.cursor()
# start with likes
idx = 0
total_positive = 0
total_negative = 0
for row in cursor.execute('SELECT * FROM likes ORDER BY date(date) DESC').fetchall():
if idx > pos_cutoff:
test_set = []
try:
a = prep_tweet(row)
if len(a) > 0:
if classifier.classify(a) == 'neg':
total_negative += 1
else:
total_positive += 1
except UnicodeDecodeError:
pass
idx += 1
print("Checking positive Tweets")
p = float(total_positive) / float(total_positive + total_negative) * 100.0
print ("Results: Positive: " + str(total_positive) + " Negative: " + str(total_negative) + " Percentage Correct: " + str(p) + "%")
idx = 0
total_positive = 0
total_negative = 0
for row in cursor.execute('SELECT * FROM dislikes ORDER BY date(date) DESC').fetchall():
if idx > neg_cutoff:
test_set = []
try:
a = prep_tweet(row)
if len(a) > 0:
if classifier.classify(a) == 'pos':
total_negative += 1
else:
total_positive += 1
except UnicodeDecodeError:
pass
idx += 1
print("Checking negative Tweets")
p = float(total_positive) / float(total_positive + total_negative) * 100.0
print ("Results: Positive: " + str(total_positive) + " Negative: " + str(total_negative) + " Percentage Correct: " + str(p) + "%")
# http://streamhacker.com/2010/10/25/training-binary-text-classifiers-nltk-trainer/
# http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/comment-page-2/
def train_classifier():
conn = sqlite3.connect('fuchikoma.db')
cursor = conn.cursor()
# start with likes
for row in cursor.execute('SELECT * FROM likes ORDER BY date(date) DESC').fetchall():
try:
a = prep_tweet(row)
if len(a) > 0:
liked_ngrams.append(a)
except UnicodeDecodeError:
pass
#from sklearn.feature_extraction import DictVectorizer
#v = DictVectorizer(sparse=True)
#X = v.fit_transform(liked_ngrams)
#print(X)
#print(v.inverse_transform(X))
# now dislikes
for row in cursor.execute('SELECT * FROM dislikes ORDER BY date(date) DESC').fetchall():
try:
a = prep_tweet(row)
if len(a) > 0:
disliked_ngrams.append(a)
except UnicodeDecodeError:
pass
pos_cutoff = int(len(liked_ngrams)*0.75)
neg_cutoff = int(len(disliked_ngrams)*0.75)
training_set = [ (feat, 'pos') for feat in liked_ngrams[:pos_cutoff] ]
training_set += [ (feat, 'neg') for feat in disliked_ngrams[:neg_cutoff]]
# Finally, train the classifier and return
# By default, this appears to create clusters and vectors using
# sklearn.feature_extraction.DictVectorizer with a sparse feature set it would appear
classif = SklearnClassifier(LinearSVC())
classif.train(training_set)
return classif
if __name__ == "__main__":
test_classifier(train_classifier())
I've been pointed at http://www.aclweb.org/anthology/P14-1062 but its a bit too hard for me to understand fully. I've seen plenty of neural nets and deep learning for images, but comparatively little on text. Can someone please point me at some easier, introductory work for this sort of thing please? Cheers Ben
Upvotes: 1
Views: 1246
Reputation: 1691
There are many ways:
1) Vector space model, after you have that dense vector, you can apply a SVD eg with 300 components , then you will have a vector in 300 dimensions. Do check out this example, written by me for Semeval 2016.
2) Using CNN with word embedding layer(randomized or pretrained) , check out Keras's docs and examples:
3) To get a full picture of the problem, please check out this slide [Neural Text Embedings for Information Retrieval tutorial at WSDM 2017] (http://www.slideshare.net/BhaskarMitra3/neural-text-embeddings-for-information-retrieval-wsdm-2017)
Upvotes: 1
Reputation: 8709
You might find this blog helpful:
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
Upvotes: 1