maycca
maycca

Reputation: 4102

R: subset dataframe by two conditions based on one column

I have a dataframe containing locations (loc), where each location has two zones (type) and area of disturbances in each zone (area). From my dataframe, I would like to keep locations that have disturbances in both zones.

I thought I can simply do this by subsetting data by two conditions, however using AND condition (&)

subset(dd, (dd$type == "npr" & dd$area > 0 ) & (dd$type == "buff" & dd$area > 0 ) ) 

gives me null output

[1] loc  type area
<0 rows> (or 0-length row.names)

and using "OR" condition (|)

subset(dd, (dd$type == "npr" & dd$area > 0 ) | (dd$type == "buff" & dd$area > 0 ) ) 

is not what I want..

  loc type area
1   a  npr   10
2   a buff   20
4   b buff   10
5   c  npr    5
7   d  npr    5
8   d buff    5

How to correctly subset locations which have disturbance area > 0 in both zone types?


My dummy data:

loc<-c("a", "a", "b", "b", "c", "c", "d", "d")
type= rep(c("npr", "buff"), 4)
area = c(10,20,0,10,5,0,5,5)


dd<-data.frame(loc, type, area)

Desired output:

loc type area
1   a  npr   10
2   a buff   20
3   d  npr    5
4   d buff    5

Upvotes: 2

Views: 6021

Answers (2)

r.user.05apr
r.user.05apr

Reputation: 5456

I would reshape first:

loc<-c("a", "a", "b", "b", "c", "c", "d", "d")
type= rep(c("npr", "buff"), 4)
area = c(10,20,0,10,5,0,5,5)
dd<-data.frame(loc, type, area)

library(reshape2)
dd_wide <- dcast(dd, loc ~  type, value.var = "area")
mask <- subset(dd_wide, buff > 0 & npr > 0)[["loc"]]
subset(dd, loc %in% mask)

# one line:
subset(dd, loc %in% subset(dcast(dd, loc ~  type, value.var = "area"), buff > 0 & npr > 0)[["loc"]])

Upvotes: 2

Mikko Marttila
Mikko Marttila

Reputation: 11908

You'll need to compute a grouped summary to achieve this. That is, you want to find out for each loc if all of the areas in that location are > 0. I have always found base R a bit awkward for grouped statistics, but here's one way to achieve that.

First, use tapply() to determine for each loc whether it should be included or not:

(include <- tapply(dd$area, dd$loc, function(x) all(x > 0)))
#>     a     b     c     d 
#>  TRUE FALSE FALSE  TRUE

Then we can use loc values to index that result to get a vector suitable to subset dd with:

include[dd$loc]
#>     a     a     b     b     c     c     d     d 
#>  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE

dd[include[dd$loc], ]
#>   loc type area
#> 1   a  npr   10
#> 2   a buff   20
#> 7   d  npr    5
#> 8   d buff    5

We can also put these steps together inside a subset() call to avoid creating extra variables:

subset(dd, tapply(area, loc, function(x) all(x > 0))[loc])
#>   loc type area
#> 1   a  npr   10
#> 2   a buff   20
#> 7   d  npr    5
#> 8   d buff    5

Alternatively, you could use dplyr:

library(dplyr)

dd %>% 
  group_by(loc) %>% 
  filter(all(area > 0))
#> # A tibble: 4 x 3
#> # Groups:   loc [2]
#>   loc   type   area
#>   <fct> <fct> <dbl>
#> 1 a     npr      10
#> 2 a     buff     20
#> 3 d     npr       5
#> 4 d     buff      5

Created on 2018-07-25 by the reprex package (v0.2.0.9000).

Upvotes: 2

Related Questions