sunny
sunny

Reputation: 3891

Computing distance matrix for data table of 26000 rows and 5 columns leads to memory error

I am trying to do hierarchical clustering with a data table that isn't too large and has only numerical data.

> dim(user_info_scaled)
[1] 26497     5
> d = dist(user_info_scaled)
Error: cannot allocate vector of size 2.6 Gb

Also I have this info:

> tables()
     NAME                 NROW NCOL MB COLS                                                                             KEY
[1,] user_info         149,676   35 66 V1,run,endo.x,V1.x,_id.x,country,gender,weight,height,temperature.x,humidity.x,w                        
[2,] user_info_scaled   26,497    5  2 height,weight,ascent.x,duration.x,hour_start_per_run   

Why am I getting this error? I understand that a distance matrix is n squared, but I still don't see how that gets me to a 2.6 Gig vector. What am I missing?

Upvotes: 0

Views: 383

Answers (1)

Gavin Simpson
Gavin Simpson

Reputation: 174813

Such an R object will take up about 2.5Gb of RAM. Here's what it uses on my Linux box with 64Gb of RAM.

> obj2 <- dist(matrix(rnorm(26000 * 5), ncol = 5))
> print(object.size(obj2), units = "Gb")
2.5 Gb

It's not storing the whole n by n object (which would need ~5Gb of RAM), just the lower triangle of the matrix, hence the difference. Each element uses 8 bytes IIRC and you are storing 337 million of them.

This is just the size of the object as far as R is concerned; you might need more RAM to create it. The error is saying that R requested 2.6Gb more RAM at some point in the operation using dist() and this could not be allocated by the OS

Upvotes: 2

Related Questions