Reputation: 5424
I've noticed some strange behaviour when merging dataframes in R.
When merging, I get extra NA rows in the merged dataframe. The don't appear if displaying the dataframe and don't affect the output of length()
, but become apparent when indexing. There is a workaround, but it isn't particularly satisfying. I'd much rather not have the strange behaviour at all.
This happens using join()
in the plyr package and may also happen when joining tables in sql queries with RODBC. It may happen with other packages too.
Can anyone explain what is going on here and how to avoid it? I'm using RStudio in winXP and win8, R version 3.0.1.
> library(plyr)
> # example adapted from http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
>
> a <- as.data.frame(cbind(seq(1, 4, 1), c("Pirate", "Monkey", "Ninja", "Spaghetti")), stringsAsFactors = FALSE)
> names(a) <- c("a.id", "name")
> a
a.id name
1 1 Pirate
2 2 Monkey
3 3 Ninja
4 4 Spaghetti
>
> b <- as.data.frame(cbind(seq(1, 5, 1), c("Pirate", "Rutabaga", "Darth Vader", "Ninja", "Ninja")), stringsAsFactors = FALSE)
> names(b) <- c("b.id", "name")
> b
b.id name
1 1 Pirate
2 2 Rutabaga
3 3 Darth Vader
4 4 Ninja
5 5 Ninja
>
> z <- join(a, b, by = "name", type = "left", match = "all")
> z
a.id name b.id
1 1 Pirate 1
2 2 Monkey <NA>
3 3 Ninja 4
4 3 Ninja 5
5 4 Spaghetti <NA>
> z$compare <- ifelse(z$a.id==z$b.id, 1, 0)
> z # shows dataframe of 5 rows
a.id name b.id compare
1 1 Pirate 1 1
2 2 Monkey <NA> NA
3 3 Ninja 4 0
4 3 Ninja 5 0
5 4 Spaghetti <NA> NA
> z[z$compare==0 ,] # shows extra rows with row names of NA and NA.1
a.id name b.id compare
NA <NA> <NA> <NA> NA
3 3 Ninja 4 0
4 3 Ninja 5 0
NA.1 <NA> <NA> <NA> NA
>
> z <- z[!is.na(z$a.id),]
> z[z$compare==0 ,] # NA rows retained
a.id name b.id compare
NA <NA> <NA> <NA> NA
3 3 Ninja 4 0
4 3 Ninja 5 0
NA.1 <NA> <NA> <NA> NA
# work around to produce expected output
> z[z$compare==0 & !is.na(z$compare) ,]
a.id name b.id compare
3 3 Ninja 4 0
4 3 Ninja 5 0
Upvotes: 0
Views: 244
Reputation: 12905
Your row subset condition returns an NA which is neither TRUE or FALSE. The NA rows are referring to the second and fifth row of your data.frame.
For instance -
> z$compare==0
[1] FALSE NA TRUE TRUE NA
> z[NA,]
a.id name b.id compare
NA <NA> <NA> <NA> NA
NA.1 <NA> <NA> <NA> NA
NA.2 <NA> <NA> <NA> NA
NA.3 <NA> <NA> <NA> NA
NA.4 <NA> <NA> <NA> NA
Upvotes: 1
Reputation: 81753
If you want to extract those rows where compare
is equal to 0
, you have to exclude the NA
cases.
Have a look at the column compare
:
z$compare
[1] 1 NA 0 0 NA
When you use comparison, i.e., ==
, NA
s will result in NA
s. @Codoremifa also showed this in his answer.
z$compare == 0
# [1] FALSE NA TRUE TRUE NA
You should do both (a) compare compare
with 0
and exclude NA
s with !is.na()
:
z$compare == 0 & !is.na(z$compare)
# [1] FALSE FALSE TRUE TRUE FALSE
This command returns TRUE
and FALSE
only.
This output can be used for subsetting:
z[z$compare == 0 & !is.na(z$compare), ]
# a.id name b.id compare
# 3 3 Ninja 4 0
# 4 3 Ninja 5 0
On the behavior of NA
used for indexing rows of data frames.
All values will be replaced by NA
if you use NA
for indexing:
z[NA, ]
# a.id name b.id compare
# NA <NA> <NA> <NA> NA
# NA.1 <NA> <NA> <NA> NA
# NA.2 <NA> <NA> <NA> NA
# NA.3 <NA> <NA> <NA> NA
# NA.4 <NA> <NA> <NA> NA
(This behavior is due to vector recycling. The command is the same as z[rep(NA, nrow(z)), ]
.)
If you use an index vector including NA
s, the corresponding rows will also contain NA
s only.
For example:
z[c(TRUE, NA, FALSE, NA, TRUE), ]
# a.id name b.id compare
# 1 1 Pirate 1 1
# NA <NA> <NA> <NA> NA
# NA.1 <NA> <NA> <NA> NA
# 5 4 Spaghetti <NA> NA
As expected, this command returns the first and fifth row, but also two NA
rows corresponding to the NA
s in the index vector.
Upvotes: 2