Reputation: 77
I have a pretty big sparse matrix (40,000 x 100,000+) and I want to replace an element by 1 if it is greater than some threshold. However, each row in the matrix has a unique threshold value (this is just a vector that is the length of the rows) so I want to go row by row and check if the elements of a particular row is greater than the unique threshold value for that row.
I originally attempted this problem with a for loop by going through all the non-zero elements of the sparse matrix but this took way too long since I have over 100 million plus elements to go through.
number_of_elem <- matrix@x %>% length()
for (j in 1:number_of_elem){
threshold <- thres_array[j]
if (threshold == 0){
next
}
if (matrix@x[j] > threshold){
matrix@x[j] <- 1
}
}
I then began attempting to use the apply function but I was not able to exactly figure it out to work around the issue of skipping a threshold if it is zero. For reference, I first calculated the quantile of each row and I set my threshold to be above the 95th percentile. Since it is a sparse matrix some of the thresholds values are zeros.
Any ideas on how to approach this? From what I know in R it is highly preferred to vectorize the code and avoid for loops but I could not think of a sustainable method.
Upvotes: 0
Views: 1020
Reputation: 77
I modified @Bas solution so that it utilizes the sparsity of the matrix allowing to increase the performance.
mat@x[mat@x > thres_array[mat@i + 1] ] <- 1
mat@x
gives the non-zero elements of the sparse matrix and mat@i
gives what row that non-zero element belongs to (you have to add 1 since it is zero-indexed). Since the elements of thres_array
are based on the corresponding row you can make a logical vector from mat@x > thres_array[mat@i + 1]
and reassigns those values to 1.
Upvotes: 2
Reputation: 4658
You are right saying that in R it is often preferred to vectorize your code. Fortunately, if I understood your question correctly, this can be easily done in this case.
Since you have not provided any data (please do so in the future), I generated a threshold array thres_array
and a matrix mat
below.
Comparing each entry of thres_array
with an entire row of mat
is then a matter of mat > thres_array
, and applying the threshold can also be done in one line.
By replacing zeros in thres_array
witn Inf
, we make sure that mat > thres_array
is never true, hence skipping those values.
thres_array <- 0:9
mat <- matrix(runif(1000, max = 10), nrow = 10)
# get rid of zeros
thres_array[thres_array == 0] <- Inf
# apply threshold
mat[mat > thres_array] <- 1
For my randomly generated matrix mat
, this gives the below.
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] 8.80034895 8.422070 4.9415068 5.0451436 2.038524 0.1091817 7.900194 4.22983010 1.318235 3.9218194 7.491424 1.414268 8.9569142 3.347458
[2,] 1.00000000 1.000000 1.0000000 1.0000000 0.654243 1.0000000 1.000000 1.00000000 1.000000 1.0000000 1.000000 1.000000 1.0000000 1.000000
[3,] 1.00000000 1.000000 1.2302859 1.0000000 1.000000 0.9299740 1.000000 1.00000000 1.661907 1.0000000 1.000000 1.293784 1.0000000 1.987043
[4,] 1.01573038 1.566547 1.0000000 1.0000000 2.469330 1.0000000 0.609428 2.04922439 1.000000 1.0000000 1.000000 1.000000 1.0000000 1.000000
[5,] 1.00000000 1.000000 0.2595911 1.0000000 1.000000 3.0623223 1.000000 1.00000000 3.333816 0.7444644 1.000000 1.253450 2.6955623 1.000000
[6,] 3.66609571 1.000000 2.0263511 2.5939923 1.000000 1.0000000 1.536697 0.41910933 3.586519 1.0000000 1.000000 4.921295 1.7967002 1.000000
[7,] 1.00000000 1.000000 ...
Upvotes: 0