r.bot
r.bot

Reputation: 5424

Weird NA behaviour in R

I've noticed some strange behaviour when merging dataframes in R.

When merging, I get extra NA rows in the merged dataframe. The don't appear if displaying the dataframe and don't affect the output of length(), but become apparent when indexing. There is a workaround, but it isn't particularly satisfying. I'd much rather not have the strange behaviour at all.

This happens using join() in the plyr package and may also happen when joining tables in sql queries with RODBC. It may happen with other packages too.

Can anyone explain what is going on here and how to avoid it? I'm using RStudio in winXP and win8, R version 3.0.1.

> library(plyr)
> # example adapted from http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
> 
> a <- as.data.frame(cbind(seq(1, 4, 1), c("Pirate", "Monkey", "Ninja", "Spaghetti")), stringsAsFactors = FALSE)
> names(a) <- c("a.id", "name")
> a
  a.id      name
1    1    Pirate
2    2    Monkey
3    3     Ninja
4    4 Spaghetti
> 
> b <- as.data.frame(cbind(seq(1, 5, 1), c("Pirate", "Rutabaga", "Darth Vader", "Ninja", "Ninja")), stringsAsFactors = FALSE)
> names(b) <- c("b.id", "name")
> b
  b.id        name
1    1      Pirate
2    2    Rutabaga
3    3 Darth Vader
4    4       Ninja
5    5       Ninja
> 
> z <- join(a, b, by = "name", type = "left", match = "all")
> z
  a.id      name b.id
1    1    Pirate    1
2    2    Monkey <NA>
3    3     Ninja    4
4    3     Ninja    5
5    4 Spaghetti <NA>
> z$compare <- ifelse(z$a.id==z$b.id, 1, 0)
> z # shows dataframe of 5 rows
  a.id      name b.id compare
1    1    Pirate    1       1
2    2    Monkey <NA>      NA
3    3     Ninja    4       0
4    3     Ninja    5       0
5    4 Spaghetti <NA>      NA
> z[z$compare==0 ,] # shows extra rows with row names of NA and NA.1
     a.id  name b.id compare
NA   <NA>  <NA> <NA>      NA
3       3 Ninja    4       0
4       3 Ninja    5       0
NA.1 <NA>  <NA> <NA>      NA
> 
> z <- z[!is.na(z$a.id),]
> z[z$compare==0 ,] # NA rows retained
     a.id  name b.id compare
NA   <NA>  <NA> <NA>      NA
3       3 Ninja    4       0
4       3 Ninja    5       0
NA.1 <NA>  <NA> <NA>      NA

# work around to produce expected output
> z[z$compare==0 & !is.na(z$compare) ,]
  a.id  name b.id compare
3    3 Ninja    4       0
4    3 Ninja    5       0

Upvotes: 0

Views: 244

Answers (2)

TheComeOnMan
TheComeOnMan

Reputation: 12905

Your row subset condition returns an NA which is neither TRUE or FALSE. The NA rows are referring to the second and fifth row of your data.frame.

For instance -

> z$compare==0
[1] FALSE    NA  TRUE  TRUE    NA
> z[NA,]
     a.id name b.id compare
NA   <NA> <NA> <NA>      NA
NA.1 <NA> <NA> <NA>      NA
NA.2 <NA> <NA> <NA>      NA
NA.3 <NA> <NA> <NA>      NA
NA.4 <NA> <NA> <NA>      NA

Upvotes: 1

Sven Hohenstein
Sven Hohenstein

Reputation: 81753

If you want to extract those rows where compare is equal to 0, you have to exclude the NA cases.

Have a look at the column compare:

z$compare
[1]  1 NA  0  0 NA

When you use comparison, i.e., ==, NAs will result in NAs. @Codoremifa also showed this in his answer.

z$compare == 0
# [1] FALSE    NA  TRUE  TRUE    NA

You should do both (a) compare compare with 0 and exclude NAs with !is.na():

z$compare == 0 & !is.na(z$compare)
# [1] FALSE FALSE  TRUE  TRUE FALSE

This command returns TRUE and FALSE only.

This output can be used for subsetting:

z[z$compare == 0 & !is.na(z$compare), ]
#   a.id  name b.id compare
# 3    3 Ninja    4       0
# 4    3 Ninja    5       0

On the behavior of NA used for indexing rows of data frames.

All values will be replaced by NA if you use NA for indexing:

z[NA, ]
#      a.id name b.id compare
# NA   <NA> <NA> <NA>      NA
# NA.1 <NA> <NA> <NA>      NA
# NA.2 <NA> <NA> <NA>      NA
# NA.3 <NA> <NA> <NA>      NA
# NA.4 <NA> <NA> <NA>      NA

(This behavior is due to vector recycling. The command is the same as z[rep(NA, nrow(z)), ].)

If you use an index vector including NAs, the corresponding rows will also contain NAs only.

For example:

z[c(TRUE, NA, FALSE, NA, TRUE), ]
#      a.id      name b.id compare
# 1       1    Pirate    1       1
# NA   <NA>      <NA> <NA>      NA
# NA.1 <NA>      <NA> <NA>      NA
# 5       4 Spaghetti <NA>      NA

As expected, this command returns the first and fifth row, but also two NA rows corresponding to the NAs in the index vector.

Upvotes: 2

Related Questions