Weird NA behaviour in R

Question

I've noticed some strange behaviour when merging dataframes in R.

When merging, I get extra NA rows in the merged dataframe. The don't appear if displaying the dataframe and don't affect the output of length(), but become apparent when indexing. There is a workaround, but it isn't particularly satisfying. I'd much rather not have the strange behaviour at all.

This happens using join() in the plyr package and may also happen when joining tables in sql queries with RODBC. It may happen with other packages too.

Can anyone explain what is going on here and how to avoid it? I'm using RStudio in winXP and win8, R version 3.0.1.

> library(plyr)
> # example adapted from http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
> 
> a <- as.data.frame(cbind(seq(1, 4, 1), c("Pirate", "Monkey", "Ninja", "Spaghetti")), stringsAsFactors = FALSE)
> names(a) <- c("a.id", "name")
> a
  a.id      name
1    1    Pirate
2    2    Monkey
3    3     Ninja
4    4 Spaghetti
> 
> b <- as.data.frame(cbind(seq(1, 5, 1), c("Pirate", "Rutabaga", "Darth Vader", "Ninja", "Ninja")), stringsAsFactors = FALSE)
> names(b) <- c("b.id", "name")
> b
  b.id        name
1    1      Pirate
2    2    Rutabaga
3    3 Darth Vader
4    4       Ninja
5    5       Ninja
> 
> z <- join(a, b, by = "name", type = "left", match = "all")
> z
  a.id      name b.id
1    1    Pirate    1
2    2    Monkey 
3    3     Ninja    4
4    3     Ninja    5
5    4 Spaghetti 
> z$compare <- ifelse(z$a.id==z$b.id, 1, 0)
> z # shows dataframe of 5 rows
  a.id      name b.id compare
1    1    Pirate    1       1
2    2    Monkey       NA
3    3     Ninja    4       0
4    3     Ninja    5       0
5    4 Spaghetti       NA
> z[z$compare==0 ,] # shows extra rows with row names of NA and NA.1
     a.id  name b.id compare
NA            NA
3       3 Ninja    4       0
4       3 Ninja    5       0
NA.1          NA
> 
> z <- z[!is.na(z$a.id),]
> z[z$compare==0 ,] # NA rows retained
     a.id  name b.id compare
NA            NA
3       3 Ninja    4       0
4       3 Ninja    5       0
NA.1          NA

# work around to produce expected output
> z[z$compare==0 & !is.na(z$compare) ,]
  a.id  name b.id compare
3    3 Ninja    4       0
4    3 Ninja    5       0

Sven Hohenstein · Accepted Answer

If you want to extract those rows where compare is equal to 0, you have to exclude the NA cases.

Have a look at the column compare:

z$compare
[1]  1 NA  0  0 NA

When you use comparison, i.e., ==, NAs will result in NAs. @Codoremifa also showed this in his answer.

z$compare == 0
# [1] FALSE    NA  TRUE  TRUE    NA

You should do both (a) compare compare with 0 and exclude NAs with !is.na():

z$compare == 0 & !is.na(z$compare)
# [1] FALSE FALSE  TRUE  TRUE FALSE

This command returns TRUE and FALSE only.

This output can be used for subsetting:

z[z$compare == 0 & !is.na(z$compare), ]
#   a.id  name b.id compare
# 3    3 Ninja    4       0
# 4    3 Ninja    5       0

On the behavior of NA used for indexing rows of data frames.

All values will be replaced by NA if you use NA for indexing:

z[NA, ]
#      a.id name b.id compare
# NA           NA
# NA.1         NA
# NA.2         NA
# NA.3         NA
# NA.4         NA

(This behavior is due to vector recycling. The command is the same as z[rep(NA, nrow(z)), ].)

If you use an index vector including NAs, the corresponding rows will also contain NAs only.

For example:

z[c(TRUE, NA, FALSE, NA, TRUE), ]
#      a.id      name b.id compare
# 1       1    Pirate    1       1
# NA                NA
# NA.1              NA
# 5       4 Spaghetti       NA

As expected, this command returns the first and fifth row, but also two NA rows corresponding to the NAs in the index vector.

Weird NA behaviour in R

Answers (2)

Related Questions