Reputation: 3
I need to run a hierarchical clustering algorithm in R on a dataset with 173000 rows and 17 columns. When running the function dist() on the dataset, R aborts. I have also tried it with a Windows pc and the error message I get is "cannot allocate vector of size 110.5 Gb".
My Mac and my Windows pc have 4 GB of RAM.
Is there a way to still do this in R? I know hierarchical algorithms are not the best for large datasets but it is requireed by a University assignment.
Thank you
Upvotes: 0
Views: 59
Reputation: 76585
The problem can be solved by writing a function to compute the pairwise euclidian distances between columns of the data set, assumed below to be in tabular form. For other distances, a similar function can be written.
dist2 <- function(X){
cmb <- combn(seq_len(ncol(X)), 2)
d <- matrix(NA_real_, nrow = ncol(X), ncol = ncol(X))
if(!is.null(colnames(X)))
dimnames(d) <- list(colnames(X), colnames(X))
for(i in seq_len(ncol(cmb))){
ix <- cmb[1, i]
iy <- cmb[2, i]
res <- sqrt(sum((X[, ix] - X[, iy])^2))
d[ix, iy] <- d[iy, ix] <- res
diag(d) <- 0
}
d
}
Now test the function with a data.frame of the dimensions in the question.
set.seed(2021)
m <- replicate(17, rnorm(173000))
m <- as.data.frame(m)
dist2(m)
Upvotes: 2
Reputation: 897
First and foremost, it would be very nice of you to provide a reprex (reproducible example). Make sure you will do it later.
Speaking about the issue, you can use sample_frac
function (if I am not mistaken, this is a function from tidyverse
package). For example, sample_frac(your_data, .5)
will sample 50% of your dataframe. It will reduce the size of data to be clustered and it will be easier for your laptop.
The other way is to extend the memory.limit(size = n)
where n
is a number in megabytes.
Upvotes: 0