I am learning how to apply word2vec for document classification, but I am struggling with two issues following: My dataset consists of users' commments; some comments have only one word (e. g. "husgmabb", or a HTTP link which I simply convert it as "URL"). Can I apply word2vec to a dataset that contains such one-word comments? My dataset is labelled as "spam" or "ham"; I want to represent each document as a vector in features embedded space, then build a NN to train them. Is it a proper way for document classification? Can anyone give me some explanation, as I am just a new text mining leaner. Many thanks!

text-miningword2vec

Huy Le

Reputation: 11

word2vec for document classification

I am learning how to apply word2vec for document classification, but I am struggling with two issues following:

My dataset consists of users' commments; some comments have only one word (e. g. "husgmabb", or a HTTP link which I simply convert it as "URL"). Can I apply word2vec to a dataset that contains such one-word comments?
My dataset is labelled as "spam" or "ham"; I want to represent each document as a vector in features embedded space, then build a NN to train them. Is it a proper way for document classification?

Can anyone give me some explanation, as I am just a new text mining leaner. Many thanks!

Upvotes: 0

Answers (1)

gojomo

Reputation: 54153

Word-vectors, alone, aren't enough to do document classification. They might help in certain approaches.

Is your main goal to "learn word2vec" or to "do effective document classification"? Because if it's the latter, you should seek out online classes/tutorials on document classification – such as those which teach the use of scikit-learn algorithms – and follow those. You'd only then get into word2vec later, if necessary for certain problems.

For example, most introductory spam-classification algorithms don't use word2vec, so adding that as an extra thing to learn, when new to text-based learning, is an added complication. (Still good to learn eventually, but it's best to start simple.)

One-word texts may be garbage, or uninterpretable without much more context. (There may be something wrong in the corpus-construction if you have single-word docs – and if you were trying to solve a real community/business problem, the right thing to do might be to go back to the original data source and try to extract better examples with more context – like the speaker of the text, or any messages it was in-reply-to, etc.)

Can you do anything useful with a single nonsense word like "husgmabb"? Text understanding software generally does worse than humans who are familiar with the problem domain, so if you can't interpret "husgmabb", neither will an algorithm. (However, if there are enough examples in the training data of a mystery word that a person would understand it, if they had time to read them all, then perhaps an algorithm can also come to some understanding.)

So, if that "one word" also appears in many other examples, and those other examples help flesh out what it means, then there may be some predictive power from it appearing alone. But it depends on lots of details you'd have to share by posing more specific questions, that explain more about your goals, what you've tried, and how any existing code isn't doing what you'd expect.

Upvotes: 1

word2vec for document classification

Answers (1)

Related Questions