Reputation: 6649
Suppose you have two variables that both have some missing data, but these missing data may not overlap perfectly. What is the easiest way of finding the number of common datapoints with no missing values? Is there some built-in function?
One way is to do make a function like the following:
pairwise.miss = function(x, y) {
#deal with input types
x = as.vector(x)
y = as.vector(y)
#make combined object
c = cbind(x, y)
#remove NA rows
c = c[complete.cases(c), ]
#return length
return(nrow(c))
}
Another idea is to use some function that returns the pairwise complete data. For instance, rcorr()
from Hmisc
does this, but may give errors for non-numeric data. So:
rcorr(x, y)$n[1,2]
Is there an easier way?
Upvotes: 1
Views: 122
Reputation: 6649
I benchmarked the solutions given above:
if (!require("pacman")) install.packages("pacman")
pacman::p_load(microbenchmark)
#fetch some data
x = iris[1] #from isis
y = iris[1]
x[sample(1:150, 50), ] = NA #random subset
y[sample(1:150, 50), ] = NA
#benchmark
times = microbenchmark(pairwise.function = pairwise.miss(x, y),
sum.is.na = sum(!is.na(x) & !is.na(y)),
sum.is.na2 = sum(!(is.na(x) | is.na(y))),
sum.complete.cases = sum(complete.cases(x, y)));times
Results:
> times
Unit: microseconds
expr min lq mean median uq max neval
pairwise.function 202.205 217.2935 244.31481 233.3150 253.8460 450.763 100
sum.is.na 75.594 78.5500 89.26383 80.5730 94.1035 248.558 100
sum.is.na2 74.662 77.6170 89.23899 80.5725 94.8825 167.676 100
sum.complete.cases 14.311 16.1770 18.77197 17.1105 17.7330 155.233 100
So my original method was horribly slow compared to the sum.complete.cases one.
Perhaps there is rarely a need for speed in this computation, but one might as well use the most efficient method when it is equally easy to use.
Upvotes: 1
Reputation: 5586
You can simply list the two variables in complete.cases()
and sum()
the output.
x <- c(1, 2, 3, NA, NA, NA, 5)
y <- c(1, NA, 3, NA, 3, 2, NA)
complete.cases(x, y)
#[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE
sum(complete.cases(x, y))
#[1] 2
The sum of a logical vector is the number of TRUE
elements since TRUE
is coerced to 1 and FALSE
to 0.
This works for any data type. However, note that empty strings, i.e. ""
, are not considered missing. An actual missing character value is denoted by NA_character_
.
Upvotes: 3
Reputation: 1281
A possible solution is to use is.na
and logical operators:
!(is.na(x) | is.na(y)) # logical vector
which(!(is.na(x) | is.na(y))) # integer vector of indices.
If you want only the total count, use:
sum(!(is.na(x) | is.na(y)))
Upvotes: 1