Reputation: 108523
Say I have a data frame like this:
Df <- data.frame(
V1 = c(1,2,3,NA,5),
V2 = c(1,2,NA,4,5),
V3 = c(NA,2,NA,4,NA)
)
Now I want to count the number of valid observations for every combination of two variables. For that, I wrote a function sharedcount
:
sharedcount <- function(x,...){
nx <- names(x)
alln <- combn(nx,2)
out <- apply(alln,2,
function(y)sum(complete.cases(x[y]))
)
data.frame(t(alln),out)
}
This gives the output:
> sharedcount(Df)
X1 X2 out
1 V1 V2 3
2 V1 V3 1
3 V2 V3 2
All fine, but the function itself takes pretty long on big dataframes (600 variables and about 10000 observations). I have the feeling I'm overseeing an easier approach, especially since cor(...,use='pairwise') is running still a whole lot faster while it has to do something similar :
> require(rbenchmark)
> benchmark(sharedcount(TestDf),cor(TestDf,use='pairwise'),
+ columns=c('test','elapsed','relative'),
+ replications=1
+ )
test elapsed relative
2 cor(TestDf, use = "pairwise") 0.25 1.0
1 sharedcount(TestDf) 1.90 7.6
Any tips are appreciated.
Note : Using Vincent's trick, I wrote a function that returns the same data frame. Code in my answer below.
Upvotes: 6
Views: 10417
Reputation: 108523
Based on the lovely trick of Vincent and the additional lower.tri()
suggestion of DWin, I came up with following function that gives me the same output (i.e. a data frame) as my original one, and runs a whole lot faster :
sharedcount2 <- function(x,stringsAsFactors=FALSE,...){
counts <- crossprod(!is.na(x))
id <- lower.tri(counts)
count <- counts[id]
X1 <- colnames(counts)[col(counts)[id]]
X2 <- rownames(counts)[row(counts)[id]]
data.frame(X1,X2,count)
}
Note the use of crossprod()
, as that one gives a small improvement compared to %*%
, but it does exactly the same.
The timings :
> benchmark(sharedcount(TestDf),sharedcount2(TestDf),
+ replications=5,
+ columns=c('test','replications','elapsed','relative'))
test replications elapsed relative
1 sharedcount(TestDf) 5 10.00 90.90909
2 sharedcount2(TestDf) 5 0.11 1.00000
Note: I supplied TestDf in the question, as I noticed that the timings differ depending on the size of the data frames. As shown here, the time increase is a lot more dramatic than when compared using a small data frame.
Upvotes: 2
Reputation: 263301
I thought Vincent's looked really elegant, not to mention being faster than my sophomoric for-loop, except it seems to be needing an extraction step which I added below. This is just an example of the heavy overhead in the apply method when used with dataframes.
shrcnt <- function(Df) {Comb <- t(combn(1:ncol(Df),2) )
shrd <- 1:nrow(Comb)
for (i in seq_len(shrd)){
shrd[i] <- sum(complete.cases(Df[,Comb[i,1]], Df[,Comb[i,2]]))}
return(shrd)}
benchmark(
shrcnt(Df), sharedcount(Df), {prs <- t(x) %*% x; prs[lower.tri(prs)]},
cor(Df,use='pairwise'),
columns=c('test','elapsed','relative'),
replications=100
)
#--------------
test elapsed relative
3 { 0.008 1.0
4 cor(Df, use = "pairwise") 0.020 2.5
2 sharedcount(Df) 0.092 11.5
1 shrcnt(Df) 0.036 4.5
Upvotes: 3
Reputation: 32351
The following is slightly faster:
x <- !is.na(Df)
t(x) %*% x
# test elapsed relative
# cor(Df) 12.345 1.000000
# t(x) %*% x 20.736 1.679708
Upvotes: 10