Reputation: 215
I have been reading more modern posts about sentiment classification (analysis) such as this.
Taking the IMDB dataset as an example I find that I get a similar accuracy percentage using Doc2Vec (88%), however a far better result using a simple tfidf vectoriser with tri-grams for feature extraction (91%). I think this is similar to Table 2 in Mikolov's 2015 paper.
I thought that by using a bigger data-set this would change. So I re-ran my experiment using a breakdown of 1mill training and 1 mill test from here. Unfortunately, in that case my tfidf vectoriser feature extraction method increased to 93% but doc2vec fell to 85%.
I was wondering if this is to be expected and that others find tfidf to be superior to doc2vec even for a large corpus?
My data-cleaning is simple:
def clean_review(review):
temp = BeautifulSoup(review, "lxml").get_text()
punctuation = """.,?!:;(){}[]"""
for char in punctuation
temp = temp.replace(char, ' ' + char + ' ')
words = " ".join(temp.lower().split()) + "\n"
return words
And I have tried using 400 and 1200 features for the Doc2Vec model:
model = Doc2Vec(min_count=2, window=10, size=model_feat_size, sample=1e-4, negative=5, workers=cores)
Whereas my tfidf vectoriser has 40,000 max features:
vectorizer = TfidfVectorizer(max_features = 40000, ngram_range = (1, 3), sublinear_tf = True)
For classification I experimented with a few linear methods, however found simple logistic regression to do OK...
Upvotes: 1
Views: 2707
Reputation: 54153
The example code Mikolov once posted (https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ) used options -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1
– which in gensim would be similar to dm=0, dbow_words=1, size=100, window=10, hs=0, negative=5, sample=1e-4, iter=20, min_count=1, workers=cores
.
My hunch is that optimal values might involve a smaller window
and higher min_count
, and maybe a size
somewhere between 100 and 400, but it's been a while since I've run those experiments.
It can also sometimes help a little to re-infer vectors on the final model, using a larger-than-the-default passes
parameter, rather than re-using the bulk-trained vectors. Still, these may just converge on similar performance to Tfidf – they're all dependent on the same word-features, and not very much data.
Going to a semi-supervised approach, where some of the document-tags represent sentiments where known, sometimes also helps.
Upvotes: 3