fp1234
fp1234

Reputation: 77

How to replace row values based on a threshold of a sparse matrix in R?

I have a pretty big sparse matrix (40,000 x 100,000+) and I want to replace an element by 1 if it is greater than some threshold. However, each row in the matrix has a unique threshold value (this is just a vector that is the length of the rows) so I want to go row by row and check if the elements of a particular row is greater than the unique threshold value for that row.

I originally attempted this problem with a for loop by going through all the non-zero elements of the sparse matrix but this took way too long since I have over 100 million plus elements to go through.

number_of_elem <- matrix@x %>% length()
for (j in 1:number_of_elem){

  threshold <- thres_array[j] 

  if (threshold == 0){
    next
  }

  if (matrix@x[j] > threshold){

    matrix@x[j] <- 1

  }

}

I then began attempting to use the apply function but I was not able to exactly figure it out to work around the issue of skipping a threshold if it is zero. For reference, I first calculated the quantile of each row and I set my threshold to be above the 95th percentile. Since it is a sparse matrix some of the thresholds values are zeros.

Any ideas on how to approach this? From what I know in R it is highly preferred to vectorize the code and avoid for loops but I could not think of a sustainable method.

Upvotes: 0

Views: 1020

Answers (2)

fp1234
fp1234

Reputation: 77

I modified @Bas solution so that it utilizes the sparsity of the matrix allowing to increase the performance.

mat@x[mat@x > thres_array[mat@i + 1] ] <- 1

mat@x gives the non-zero elements of the sparse matrix and mat@i gives what row that non-zero element belongs to (you have to add 1 since it is zero-indexed). Since the elements of thres_array are based on the corresponding row you can make a logical vector from mat@x > thres_array[mat@i + 1] and reassigns those values to 1.

Upvotes: 2

Bas
Bas

Reputation: 4658

You are right saying that in R it is often preferred to vectorize your code. Fortunately, if I understood your question correctly, this can be easily done in this case.

Since you have not provided any data (please do so in the future), I generated a threshold array thres_array and a matrix mat below.
Comparing each entry of thres_array with an entire row of mat is then a matter of mat > thres_array, and applying the threshold can also be done in one line.
By replacing zeros in thres_array witn Inf, we make sure that mat > thres_array is never true, hence skipping those values.

thres_array <- 0:9
mat <- matrix(runif(1000, max = 10), nrow = 10)

# get rid of zeros
thres_array[thres_array == 0] <- Inf

# apply threshold
mat[mat > thres_array] <- 1

For my randomly generated matrix mat, this gives the below.

           [,1]     [,2]      [,3]      [,4]     [,5]      [,6]     [,7]       [,8]     [,9]     [,10]    [,11]    [,12]     [,13]    [,14]
 [1,] 8.80034895 8.422070 4.9415068 5.0451436 2.038524 0.1091817 7.900194 4.22983010 1.318235 3.9218194 7.491424 1.414268 8.9569142 3.347458
 [2,] 1.00000000 1.000000 1.0000000 1.0000000 0.654243 1.0000000 1.000000 1.00000000 1.000000 1.0000000 1.000000 1.000000 1.0000000 1.000000
 [3,] 1.00000000 1.000000 1.2302859 1.0000000 1.000000 0.9299740 1.000000 1.00000000 1.661907 1.0000000 1.000000 1.293784 1.0000000 1.987043
 [4,] 1.01573038 1.566547 1.0000000 1.0000000 2.469330 1.0000000 0.609428 2.04922439 1.000000 1.0000000 1.000000 1.000000 1.0000000 1.000000
 [5,] 1.00000000 1.000000 0.2595911 1.0000000 1.000000 3.0623223 1.000000 1.00000000 3.333816 0.7444644 1.000000 1.253450 2.6955623 1.000000
 [6,] 3.66609571 1.000000 2.0263511 2.5939923 1.000000 1.0000000 1.536697 0.41910933 3.586519 1.0000000 1.000000 4.921295 1.7967002 1.000000
 [7,] 1.00000000 1.000000 ...

Upvotes: 0

Related Questions