Kevin Zakka
Kevin Zakka

Reputation: 465

Discrepancy between complete.cases() and na.omit()

Was messing around with the Auto dataset in R.

If I run the following:

auto = read.csv("Auto.csv", header=TRUE, na.strings="?")
summary(complete.cases(auto))

I get the following:

   Mode   FALSE    TRUE    NA's 
logical       5     392       0

However, when I run this, I get different results:

auto1 = na.omit(auto)
dim(auto)  # returns [1] 397   9
dim(auto1) # returns [1] 392   9

Why does complete.cases() tell me I have no NA's but na.omit() seems to be removing some entries?

Upvotes: 2

Views: 1736

Answers (1)

akrun
akrun

Reputation: 887048

The difference is that complete.cases returns a logical vector of the same length as the number of rows of the dataset while na.omit removes row that have at least one NA. Using the reproducible example created below,

complete.cases(auto)
#[1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE

As we can see, it is a logical vector with no NAs. It gives TRUE for rows that doesn't have any NAs. So, obviously, doing summary on a logical vector returns no NA's.

summary(complete.cases(auto))
#  Mode   FALSE    TRUE    NA's 
#logical       4       6       0 

Suppose, we need to get the same result as the na.omit, the logical vector derived should be used to subset the original dataset

autoN <- auto[complete.cases(auto),]
auto1 <- na.omit(auto)
dim(autoN)
#[1] 6 2
dim(auto1)
#[1] 6 2

Though, the results will be similar, na.omit also returns some attributes

str(autoN)
#'data.frame':   6 obs. of  2 variables:
# $ v1: int  1 2 2 2 3 3
# $ v2: int  3 3 3 1 4 2
str(auto1)
#'data.frame':   6 obs. of  2 variables:
# $ v1: int  1 2 2 2 3 3
# $ v2: int  3 3 3 1 4 2
# - attr(*, "na.action")=Class 'omit'  Named int [1:4] 2 7 8 10
#  .. ..- attr(*, "names")= chr [1:4] "2" "7" "8" "10"

and would be slower compared to complete.cases based on the benchmarks showed below.

Benchmarks

set.seed(238)
df1 <- data.frame(v1 = sample(c(NA, 1:9), 1e7, replace=TRUE),
              v2 = sample(c(NA, 1:50), 1e7, replace=TRUE))
system.time(na.omit(df1))
#  user  system elapsed 
#   2.50    0.19    2.69 
system.time(df1[complete.cases(df1),])
#  user  system elapsed 
#  0.61    0.09    0.70 

data

set.seed(24)
auto <- data.frame(v1 = sample(c(NA, 1:3), 10, replace=TRUE), 
                   v2 = sample(c(NA, 1:4), 10, replace=TRUE))

Upvotes: 6

Related Questions