Ilia Karmanov
Ilia Karmanov

Reputation: 215

Is Doc2Vec suited for Sentiment Analysis?

I have been reading more modern posts about sentiment classification (analysis) such as this.

Taking the IMDB dataset as an example I find that I get a similar accuracy percentage using Doc2Vec (88%), however a far better result using a simple tfidf vectoriser with tri-grams for feature extraction (91%). I think this is similar to Table 2 in Mikolov's 2015 paper.

I thought that by using a bigger data-set this would change. So I re-ran my experiment using a breakdown of 1mill training and 1 mill test from here. Unfortunately, in that case my tfidf vectoriser feature extraction method increased to 93% but doc2vec fell to 85%.

I was wondering if this is to be expected and that others find tfidf to be superior to doc2vec even for a large corpus?

My data-cleaning is simple:

def clean_review(review):
    temp = BeautifulSoup(review, "lxml").get_text()
    punctuation = """.,?!:;(){}[]"""
    for char in punctuation
        temp = temp.replace(char, ' ' + char + ' ')
    words = " ".join(temp.lower().split()) + "\n"
    return words

And I have tried using 400 and 1200 features for the Doc2Vec model:

model = Doc2Vec(min_count=2, window=10, size=model_feat_size, sample=1e-4, negative=5, workers=cores)

Whereas my tfidf vectoriser has 40,000 max features:

vectorizer = TfidfVectorizer(max_features = 40000, ngram_range = (1, 3), sublinear_tf = True)

For classification I experimented with a few linear methods, however found simple logistic regression to do OK...

Upvotes: 1

Views: 2707

Answers (1)

gojomo
gojomo

Reputation: 54153

The example code Mikolov once posted (https://groups.google.com/d/msg/word2vec-toolkit/Q49FIrNOQRo/J6KG8mUj45sJ) used options -cbow 0 -size 100 -window 10 -negative 5 -hs 0 -sample 1e-4 -threads 40 -binary 0 -iter 20 -min-count 1 -sentence-vectors 1 – which in gensim would be similar to dm=0, dbow_words=1, size=100, window=10, hs=0, negative=5, sample=1e-4, iter=20, min_count=1, workers=cores.

My hunch is that optimal values might involve a smaller window and higher min_count, and maybe a size somewhere between 100 and 400, but it's been a while since I've run those experiments.

It can also sometimes help a little to re-infer vectors on the final model, using a larger-than-the-default passes parameter, rather than re-using the bulk-trained vectors. Still, these may just converge on similar performance to Tfidf – they're all dependent on the same word-features, and not very much data.

Going to a semi-supervised approach, where some of the document-tags represent sentiments where known, sometimes also helps.

Upvotes: 3

Related Questions