LivingRobot
LivingRobot

Reputation: 913

Visualizing KMeans Clustering with Many Inputs

I'm totally new to machine learning (and full disclosure: this is for school) and am trying to wrap my head around KMeans Clustering and its implementation. I understand the gist of the algorithm and have implemented it in Java but I'm a little confused as to how to use it on a complex dataset.

For example, I have 3 folders, A, B and C, each one containing 8 text files (so 24 text files altogether). I want to verify that I've implemented KMeans correctly by having the algorithm cluster these 24 documents into 3 clusters based on their word usage.

To this effect, I've created a word frequency matrix and performed tfidf on it to create a sparse matrix which is 24 x 2367 (24 documents and 2367 words/ -grams in total). Then, I want to run my KMeans Clustering algorithm on my tfidf matrix and am not getting good results.

In order to try to debug I'm trying to visaulize my tfidf matrix and the centroids I get as output, but I don't quite understand how one would visualize this 24 x 2367 matrix? I've also saved this matrix to a .csv file and want to run a python library on it - but everything I've seen is an n x 2 matrix. How would one go about doing this?

Thanks in advance,

Upvotes: 0

Views: 563

Answers (2)

ace_racer
ace_racer

Reputation: 526

There are a few things that I would suggest (although I am not sure if SO is the right place for this question):

a. Since you mention that you are clustering unstructured text documents and you do not get good results, you might need to apply typical text mining pre-processing tasks like stop word, punctuation removal, case-lowering, stemming before generation of the TF-IDF matrix. There are other text pre-processing tasks like removing numbers, patterns etc. and need to be evaluated on case by case basis.

b. As far as the visualization in 2 D is concerned, you would need to reduce the dimension of the feature vector to 2. The dimension might reduce from 2367 after the pre-processing but not a lot. You can then use SVD on the TF-IDF matrix and check the amount of variance it can explain. However, reducing to 2 components might result in great data loss and the visualizations will not be that meaningful. But you can give this a try and see if the results make sense.

c. If the text content in the documents are small, you can try to craft handcrafted tags that describe the document. These tags should not number more than 20 per document. With this new tags you can create a TF-IDF matrix and perform the SVD which might give more interpretable results in 2D visualizations.

d. In order to evaluate the generated clusters, Silhouette measure can also be considered.

Upvotes: 2

Tomasz Gandor
Tomasz Gandor

Reputation: 8823

Because this is for school, there will be no code here, just ideas.

The CSV writing and reading will also be left to the reader (just a note: consider alternatives - saving/loading numpy arrays, h5py library, and json or msgpack for a start).

The problem for humans with looking at a 24 x 2367 matrix is, that it is too wide. The numbers in it are also looking like gibberish. But people, unlike computers, like images more (computers don't care).

You need to map the tf-idf values to 0-255, and make an image. 24 x 2367 is well below a megapixel. But making it 24 x 2367 is a little too elongated. Pad your rows to something that can make a nice rectangle or an approximate square (2400 or 2401 should be fine), and generate an image for each row. You can then look at individual rows, or tile them to get a full 6 x 4 image of all your documents (remember about some padding in-between. If your pixels are gray, choose a colorful padding).

Further ideas:

  • colormaps
  • PCA
  • t-SNE

Upvotes: 1

Related Questions