user554481
user554481

Reputation: 2115

Mahout text mining - most important words for a given singular value

Question: Is there an easy way to see the most important words associated with each singular value?

Background: I have applied Mahout’s singular value decomposition tool to a collection of news articles. The articles come from two topics: 1) sports, and 2) business. I would like to see the most important words associated with each singular value. For example, for one singular value I might expect the most prominent words to be sports terms: score, team, player, coach. For another singular value I might expect to see business terms: company, profit, revenue.

My Approach: I am considering making a file for each singular value, where -- for a given singular value -- the words are ordered in descending order of importance. This is just an idea. I'm open to suggestions.

Below is the code I have used so far to generate Mahout's singular value:

/mahout-distribution-0.7/bin/mahout svd 
-i /vectors/tfidf-vectors/
-o /svd-values/
--numRows 100 
--numCols 591 
-r 100

Upvotes: 0

Views: 419

Answers (1)

Sean Owen
Sean Owen

Reputation: 66876

There's no way to do this directly in the project, and I don't know that code myself anyway. But I can tell you the general idea.

In the SVD you get a decomposition like A ~= U S V'. Let's say A is your document-term matrix. So the columns of A -- and so columns of V' -- correspond to words. The rows of V' correspond to singular values (in S). They're the right singular vectors in fact. You can read off directly from these how the singular vectors relate to words. Largest absolute values are the words that matter most.

Upvotes: 1

Related Questions