Reputation: 465
Was messing around with the Auto dataset in R.
If I run the following:
auto = read.csv("Auto.csv", header=TRUE, na.strings="?")
summary(complete.cases(auto))
I get the following:
Mode FALSE TRUE NA's
logical 5 392 0
However, when I run this, I get different results:
auto1 = na.omit(auto)
dim(auto) # returns [1] 397 9
dim(auto1) # returns [1] 392 9
Why does complete.cases() tell me I have no NA's but na.omit() seems to be removing some entries?
Upvotes: 2
Views: 1736
Reputation: 887048
The difference is that complete.cases
returns a logical vector of the same length as the number of rows of the dataset while na.omit
removes row that have at least one NA. Using the reproducible example created below,
complete.cases(auto)
#[1] TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE
As we can see, it is a logical vector with no NAs. It gives TRUE
for rows that doesn't have any NAs. So, obviously, doing summary
on a logical vector returns no NA's.
summary(complete.cases(auto))
# Mode FALSE TRUE NA's
#logical 4 6 0
Suppose, we need to get the same result as the na.omit
, the logical vector derived should be used to subset the original dataset
autoN <- auto[complete.cases(auto),]
auto1 <- na.omit(auto)
dim(autoN)
#[1] 6 2
dim(auto1)
#[1] 6 2
Though, the results will be similar, na.omit
also returns some attributes
str(autoN)
#'data.frame': 6 obs. of 2 variables:
# $ v1: int 1 2 2 2 3 3
# $ v2: int 3 3 3 1 4 2
str(auto1)
#'data.frame': 6 obs. of 2 variables:
# $ v1: int 1 2 2 2 3 3
# $ v2: int 3 3 3 1 4 2
# - attr(*, "na.action")=Class 'omit' Named int [1:4] 2 7 8 10
# .. ..- attr(*, "names")= chr [1:4] "2" "7" "8" "10"
and would be slower compared to complete.cases
based on the benchmarks showed below.
set.seed(238)
df1 <- data.frame(v1 = sample(c(NA, 1:9), 1e7, replace=TRUE),
v2 = sample(c(NA, 1:50), 1e7, replace=TRUE))
system.time(na.omit(df1))
# user system elapsed
# 2.50 0.19 2.69
system.time(df1[complete.cases(df1),])
# user system elapsed
# 0.61 0.09 0.70
set.seed(24)
auto <- data.frame(v1 = sample(c(NA, 1:3), 10, replace=TRUE),
v2 = sample(c(NA, 1:4), 10, replace=TRUE))
Upvotes: 6