Reputation: 171

How to select rows based on the combination of columns

I have the following data frame:

structure(list(Species = 1:4, Ni = c(1, NA, 1, 1), Zn = c(1, 
1, 1, 1), Cu = c(NA, NA, 1, NA)), .Names = c("Species", "Ni", 
"Zn", "Cu"), row.names = c(NA, -4L), class = "data.frame")

and I would like to get a vector containing all the species where Ni = 1, Zn = 1 and Cu = NA. So in this example that would be (1,4)

I thought I could have a try with the R script select * from where, but I can't seem to install the package RMySQL on RStudio (R version 2.15.1).

Upvotes: 0

Answers (4)

Backlin

Reputation: 14852

df <- structure(list(Species=1:4,Ni=c(1,NA,1,1),Zn=c(1,1,1,1),Cu=c(NA,NA,1,NA)),
      .Names=c("Species","Ni","Zn","Cu"),row.names=c(NA,-4L),class="data.frame")

with(df, Species[Ni %in% 1 & Zn %in% 1 & Cu %in% NA])
[1] 1 4

Rather than using Ni == 1 you should use Ni %in% 1, as the former will return NA elements where Ni is NA. Cu %in% NA produces the same result as is.na(Cu).

with(df, Species[Ni == 1 & Zn %in% 1 & Cu %in% NA])
[1] 1 NA 4

Note though that Ni == 1 used in subset, as in @MadScone's answer, does not suffer from this (which came as a surprise to me).

subset(df, Ni == 1 & Zn == 1 & is.na(Cu), Species)
  Species
1       1
4       4

Upvotes: 4

Sven Hohenstein

Reputation: 81693

df <- structure(list(Species = 1:4, Ni = c(1, NA, 1, 1), Zn = c(1, 1, 1, 1),
                     Cu = c(NA, NA, 1, NA)),
                .Names = c("Species", "Ni", "Zn", "Cu"), row.names = c(NA, -4L),
                class = "data.frame")

If the columns Ni, Zn, and Cu contain 1 and NA only, you could simply use:

subset(df, Ni & Zn & is.na(Cu), Species)

Upvotes: 0

mjv

Reputation: 75205

Edit:
I stand corrected by Backlin...
It is much better to use %in% rather than == x & !is.na() !
MadScone's suggestion of using subset() is better yet, since the result remains a data.frame, even when one select only one column for the output, i.e.

> class(subset(df, Ni == 1 & Zn == 1 & is.na(Cu), Species))
[1] "data.frame"
#whereby we get a vector when only one column is selected...
> class(df[df$Ni %in% 1 & df$Zn %in% 1 & is.na(df$Cu), 1])
[1] "integer"
# but we get data.frame when using multiple columns...
> class(df[df$Ni %in% 1 & df$Zn %in% 1 & is.na(df$Cu), 1:2])
[1] "data.frame"

I'm just leaving my sub-par answer to mention this alternative idiom, as one one should avoid!

Setup:

>  df <- structure(list(Species = 1:4, Ni = c(1, NA, 1, 1), Zn = c(1, 
    1, 1, 1), Cu = c(NA, NA, 1, NA)), .Names = c("Species", "Ni", 
    "Zn", "Cu"), row.names = c(NA, -4L), class = "data.frame")
> df
  Species Ni Zn Cu
1       1  1  1 NA
2       2 NA  1 NA
3       3  1  1  1
4       4  1  1 NA

Query:

>  df[df$Ni == 1 & !is.na(df$Ni) 
      & df$Zn == 1 & !is.na(df$Zn) 
      & is.na(!df$Cu), ]

  Species Ni Zn Cu
 1       1  1  1 NA
 4       4  1  1 NA

The trick with NA values is to explicitly exclude them i.e. for example with Ni, request value 1 and !is.na() etc. Failure to do so results in finding records where, say, Ni is NA

As mentioned above the df[df$Ni %in% 1 & df$Zn %in% 1 & is.na(!df$Cu), ] idiom is much preferable and using subset() is typically better yet.

> df[df$Ni == 1 & df$Zn == 1 & is.na(!df$Cu), ]
   Species Ni Zn Cu
1        1  1  1 NA
NA      NA NA NA NA       # OOPS...
4        4  1  1 NA

Upvotes: 0

Ciarán Tobin

Reputation: 7526

Have a look at subset().

x <- structure(list(Species = 1:4, Ni = c(1, NA, 1, 1), Zn = c(1, 
     1, 1, 1), Cu = c(NA, NA, 1, NA)), .Names = c("Species", "Ni", 
     "Zn", "Cu"), row.names = c(NA, -4L), class = "data.frame")

subset(x, Ni == 1 & Zn == 1 & is.na(Cu), Species)

Upvotes: 2

How to select rows based on the combination of columns

Answers (4)

Related Questions