R aborts when using function DIST (110 GB vector)

I need to run a hierarchical clustering algorithm in R on a dataset with 173000 rows and 17 columns. When running the function dist() on the dataset, R aborts. I have also tried it with a Windows pc and the error message I get is "cannot allocate vector of size 110.5 Gb".

My Mac and my Windows pc have 4 GB of RAM.

Is there a way to still do this in R? I know hierarchical algorithms are not the best for large datasets but it is requireed by a University assignment.

Thank you

Upvotes: 0

Answers (2)

Rui Barradas

Reputation: 76585

The problem can be solved by writing a function to compute the pairwise euclidian distances between columns of the data set, assumed below to be in tabular form. For other distances, a similar function can be written.

dist2 <- function(X){
  cmb <- combn(seq_len(ncol(X)), 2)
  d <- matrix(NA_real_, nrow = ncol(X), ncol = ncol(X))
  if(!is.null(colnames(X)))
    dimnames(d) <- list(colnames(X), colnames(X))
  
  for(i in seq_len(ncol(cmb))){
    ix <- cmb[1, i]
    iy <- cmb[2, i]
    res <- sqrt(sum((X[, ix] - X[, iy])^2))
    d[ix, iy] <- d[iy, ix] <- res
    diag(d) <- 0
  }
  
  d
}

Now test the function with a data.frame of the dimensions in the question.

set.seed(2021)
m <- replicate(17, rnorm(173000))
m <- as.data.frame(m)

dist2(m)

Upvotes: 2

rg4s

Reputation: 897

First and foremost, it would be very nice of you to provide a reprex (reproducible example). Make sure you will do it later.

Speaking about the issue, you can use sample_frac function (if I am not mistaken, this is a function from tidyverse package). For example, sample_frac(your_data, .5) will sample 50% of your dataframe. It will reduce the size of data to be clustered and it will be easier for your laptop.

The other way is to extend the memory.limit(size = n) where n is a number in megabytes.

Upvotes: 0

R aborts when using function DIST (110 GB vector)

Answers (2)

Related Questions