Reputation: 167
I have this dataset
data
[C1] [C2] [C3] [C4] [C5] [C6] [C7] [C8]
[1,] 5 1 2 1 4 2 1 NA
[2,] 4 1 3 4 1 1 NA 2
[3,] 3 4 6 7 1 1 2 2
[4,] 1 3 NA 1 NA 2 NA NA
[5,] 1 NA 5 NA NA 4 1 2
[6,] 1 4 NA NA NA 4 1 2
[7,] 1 4 NA NA NA 4 1 2
I want to add new column C9 which could take two values 1 (True) if the corresponding row has the value 1 in columns C2 ,C3 or C4 or 0 (False) otherwise. I have tried this code
C9<-data[,2:4]==1
#change the logical matrix into numeric
C9<-C9*1
#convert the matrix into vector #
C9<-rowSums(C9)
data=cbind(data,C9)
The code works well but consumes more time so my question is there a unique way to do that , since I am beginner in R ?.
Upvotes: 0
Views: 1100
Reputation: 59355
If I understand the question correctly, C9 must be 1 if one of C2, C3, or C4 is exactly 1, 0 otherwise. So the solution has to deal with NA
s.
This compares three approaches:
f.1 <- function() (rowSums(data[,2:4]==1, na.rm=TRUE)>0)*1L
f.2 <- function() {x<-rep(0L,nrow(data)); x[(data[,2]==1 | data[,3]==1 | data[,4]==1)]<-1L; x}
f.3 <- function() apply(data[,2:4], 1, function(x) any(x==1, na.rm=T))*1L
library(microbenchmark)
microbenchmark(f.1(),f.2(),f.3(), times=1000)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# f.1() 11.845 15.991 20.76593 18.952 22.5050 293.751 1000 a
# f.2() 10.660 14.806 44.43363 17.768 20.7290 25063.000 1000 a
# f.3() 81.137 91.797 121.80148 103.050 125.8515 2719.566 1000 b
identical(f.1(),f.2())
# [1] TRUE
identical(f.1(),f.3())
# [1] TRUE
f.1()
is your approach (more or less), f.2()
is a very simple and direct approach, and f.3()
is from the comment. As you can see, the simple/direct approach is fastest in this case, but just by a few percent.
Why do you think this is too slow?
Upvotes: 1