daveyjones
daveyjones

Reputation: 123

Adding labels to a K-means cluster plot in R

I have used some R code I found online to make a K-Means cluster plot, as follows:

dtmr <-DocumentTermMatrix(docs,control=list(wordLengths=c(4,15), bounds = list(global = c(50,500))))
## do tfxidf
dtm_tfxidf <- weightTfIdf(dtmr)

### k-means (this uses euclidean distance)
m <- as.matrix(dtm_tfxidf)
rownames(m) <- 1:nrow(m)

### don't forget to normalize the vectors so Euclidean makes sense
norm_eucl <- function(m) m/apply(m, MARGIN=1, FUN=function(x) sum(x^2)^.5)
m_norm <- norm_eucl(m)


### cluster into 5 clusters
cl <- kmeans(m_norm, 5)

table(cl$cluster)

### show clusters using the first 2 principal components
plot(prcomp(m_norm)$x, col=cl$cl, text(m_norm, mpg, row.names(m)))

This does give me plot of the 5 clusters, I amjust wondering how can I add labels to show what each dot is?

And on a side note, is there anyway that I can see what these clusters are? The table(cl$cluster) line just prints five numbers, I do not know what these numbers mean, my data that I am using is just over 400 text documents.

Upvotes: 1

Views: 4292

Answers (1)

Scientist_jake
Scientist_jake

Reputation: 251

The problems I can see are that the text() is inside the plot call when it should come after and that the x and y passed to text are not the same used to generate the plot, the result of prcomp.

I'm using mtcars as a dataset:

df<- mtcars

### k-means (this uses euclidean distance)
m <- as.matrix(df)
rownames(m) <- 1:nrow(m)

### don't forget to normalize the vectors so Euclidean makes sense
norm_eucl <- function(m) m/apply(m, MARGIN=1, FUN=function(x) sum(x^2)^.5)
m_norm <- norm_eucl(m)


### cluster into 5 clusters
cl <- kmeans(m_norm, 5)

table(cl$cluster)

### show clusters using the first 2 principal components

# do the PCA outside the plot function for now
PCA <-prcomp(m_norm)$x

#plot then add labels
plot(PCA, col=cl$cl)
text(x=PCA[,1], y=PCA[,2], cex=0.6, pos=4, labels=(row.names(m)))

enter image description here

For the second question, the cluster assignments are in cl$cluster. The table() call just counts how many members of each cluster there are, which is why it's reporting five numbers for you.

Upvotes: 2

Related Questions