Hassan Saif
Hassan Saif

Reputation: 1062

Clustering Large Data Matrix using R

I have a large data matrix (33183x1681), each row corresponding to one observation and each column corresponding to the variables.

I applied K-medoids clustering using PAM function in R, and I tried to visualize the clustering results using the built-in plots available with the PAM function. I got this error:

Error in princomp.default(x, scores = TRUE, cor = ncol(x) != 2) :
cannot use cor=TRUE with a constant variable

I think this problem is because of the high dimensionality of the data matrix I'm trying to cluster.

Any thoughts/ideas how to tackle this issue?

Upvotes: 1

Views: 7498

Answers (1)

Gavin Simpson
Gavin Simpson

Reputation: 174788

Check out the clara() function in package cluster which is shipped with all versions of R.

library("cluster")
## generate 500 objects, divided into 2 clusters.
x <- rbind(cbind(rnorm(200,0,8), rnorm(200,0,8)),
           cbind(rnorm(300,50,8), rnorm(300,50,8)))
clarax <- clara(x, 2, samples=50)
clarax

> clarax
Call:    clara(x = x, k = 2, samples = 50) 
Medoids:
         [,1]       [,2]
[1,] -1.15913  0.5760027
[2,] 50.11584 50.3360426
Objective function:  10.23341
Clustering vector:   int [1:500] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
Cluster sizes:           200 300 
Best sample:
 [1]  10  17  45  46  68  90  99 150 151 160 184 192 232 238 243 250 266 275 277
[20] 298 303 304 313 316 327 333 339 353 358 398 405 410 411 421 426 429 444 447
[39] 456 477 481 494 499 500

Available components:
 [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
 [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"

Note that you should study the help for clara() (?clara) in some detail as well as the references cited in order to make the clustering performed by clara() as close to or identical to pam().

Upvotes: 6

Related Questions