CoderGuy123
CoderGuy123

Reputation: 6649

What is the easiest way to find the pairwise complete data for two variables?

Suppose you have two variables that both have some missing data, but these missing data may not overlap perfectly. What is the easiest way of finding the number of common datapoints with no missing values? Is there some built-in function?

One way is to do make a function like the following:

pairwise.miss = function(x, y) {
  #deal with input types
  x = as.vector(x)
  y = as.vector(y)
  #make combined object
  c = cbind(x, y)
  #remove NA rows
  c = c[complete.cases(c), ]
  #return length
  return(nrow(c))
}

Another idea is to use some function that returns the pairwise complete data. For instance, rcorr() from Hmisc does this, but may give errors for non-numeric data. So:

rcorr(x, y)$n[1,2]

Is there an easier way?

Upvotes: 1

Views: 122

Answers (3)

CoderGuy123
CoderGuy123

Reputation: 6649

I benchmarked the solutions given above:

if (!require("pacman")) install.packages("pacman")
pacman::p_load(microbenchmark)

#fetch some data
x = iris[1] #from isis
y = iris[1]
x[sample(1:150, 50), ] = NA #random subset
y[sample(1:150, 50), ] = NA

#benchmark
times = microbenchmark(pairwise.function = pairwise.miss(x, y),
                       sum.is.na = sum(!is.na(x) & !is.na(y)),
                       sum.is.na2 = sum(!(is.na(x) | is.na(y))),
                       sum.complete.cases = sum(complete.cases(x, y)));times

Results:

> times
Unit: microseconds
               expr     min       lq      mean   median       uq     max neval
  pairwise.function 202.205 217.2935 244.31481 233.3150 253.8460 450.763   100
          sum.is.na  75.594  78.5500  89.26383  80.5730  94.1035 248.558   100
         sum.is.na2  74.662  77.6170  89.23899  80.5725  94.8825 167.676   100
 sum.complete.cases  14.311  16.1770  18.77197  17.1105  17.7330 155.233   100

So my original method was horribly slow compared to the sum.complete.cases one.

Perhaps there is rarely a need for speed in this computation, but one might as well use the most efficient method when it is equally easy to use.

Upvotes: 1

Alex A.
Alex A.

Reputation: 5586

You can simply list the two variables in complete.cases() and sum() the output.

x <- c(1, 2, 3, NA, NA, NA, 5)
y <- c(1, NA, 3, NA, 3, 2, NA)

complete.cases(x, y)
#[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE

sum(complete.cases(x, y))
#[1] 2

The sum of a logical vector is the number of TRUE elements since TRUE is coerced to 1 and FALSE to 0.

This works for any data type. However, note that empty strings, i.e. "", are not considered missing. An actual missing character value is denoted by NA_character_.

Upvotes: 3

B.Shankar
B.Shankar

Reputation: 1281

A possible solution is to use is.na and logical operators:

!(is.na(x) | is.na(y))        # logical vector

which(!(is.na(x) | is.na(y))) # integer vector of indices.

If you want only the total count, use:

sum(!(is.na(x) | is.na(y)))

Upvotes: 1

Related Questions