Reputation: 251

Drop rows conditional on value on other rows using dplyr in R

Using the example data provided below: For each institution type ("a" and "b") I want to drop rows with fac == "no" if there exists a row with fac == "yes" for the same year. I then want to sum the values by year. I am, however, not able to figure out how to drop the correct "no"-rows. Below are a couple of my attempts based on answers give here.

set.seed(123)
ext <- tibble(
  institution = c(rep("a", 7), rep("b", 7)),
  year = rep(c("2005", "2005", "2006", "2007", "2008", "2009", "2009"), 2),
  fac = rep(c("yes", "no", "no", "no", "no", "yes", "no"), 2),
  value = sample(1:100, 14, replace=T)
)

ext %>%
  group_by(institution, year) %>%
  filter(if (fac == "yes") fac != "no")

ext %>%
  group_by(institution, year) %>%
  case_when(fac == "yes" ~ filter(., fac != "no"))

ext %>%
  group_by(institution, year) %>%
  {if (fac == "yes") filter(., fac != "no")}

Upvotes: 4

Answers (4)

akrun

Reputation: 887991

An option with data.table

library(data.table)
setDT(ext)[ext[, .I[fac == 'yes'|all(fac == 'no')], .(institution, year)]$V1]

Upvotes: 1

AlexB

Reputation: 3269

Another way would be:

library(dplyr)
ext %>%
  group_by(institution, year) %>%
  filter(fac == 'yes' | n() < 2)

# institution year  fac   value
# 1 a           2005  yes      31
# 2 a           2006  no       51
# 3 a           2007  no       14
# 4 a           2008  no       67
# 5 a           2009  yes      42
# 6 b           2005  yes      43
# 7 b           2006  no       25
# 8 b           2007  no       90
# 9 b           2008  no       91
# 10 b          2009  yes      69

In case you want the overall amounts by year, add these two lines, which will yield the following output:

group_by(year) %>%
summarise(value=sum(value))

# year  value
# <chr> <int>
# 1 2005     74
# 2 2006     76
# 3 2007    104
# 4 2008    158
# 5 2009    111

Upvotes: 3

Duck

Reputation: 39623

Try creating a flag to identify the yes occurence and after that filter only the desired values. You would need to group by institution and year. Then, compute the length of values with yes greater or equal to one. With that you can flag the no values if there is some value yes inside the group. Finally, filter only the zero values in Flag and you will drop the rows as you expected. Here the code:

library(dplyr)
#Code
newdf <- ext %>% group_by(institution,year) %>%
  mutate(NYes=length(fac[fac=='yes']),
         Flag=ifelse(fac=='no' & NYes>=1,1,0)) %>%
  filter(Flag==0) %>% select(-c(NYes,Flag))

Output:

# A tibble: 10 x 4
# Groups:   institution, year [10]
   institution year  fac   value
   <chr>       <chr> <chr> <int>
 1 a           2005  yes      31
 2 a           2006  no       51
 3 a           2007  no       14
 4 a           2008  no       67
 5 a           2009  yes      42
 6 b           2005  yes      43
 7 b           2006  no       25
 8 b           2007  no       90
 9 b           2008  no       91
10 b           2009  yes      69

And the full code to summarise by year:

#Code 2
newdf <- ext %>% group_by(institution,year) %>%
  mutate(NYes=length(fac[fac=='yes']),
         Flag=ifelse(fac=='no' & NYes>=1,1,0)) %>%
  filter(Flag==0) %>% select(-c(NYes,Flag)) %>%
  ungroup() %>%
  group_by(year) %>%
  summarise(value=sum(value))

Output:

# A tibble: 5 x 2
  year  value
  <chr> <int>
1 2005     74
2 2006     76
3 2007    104
4 2008    158
5 2009    111

Upvotes: 1

Karthik S

Reputation: 11548

Does this work: by summarise, I assumed you want to sum by year after applying the filtering.

library(dplyr)
ext %>% group_by(institution, year) %>% filter(fac == 'yes'|all(fac == 'no'))
# A tibble: 10 x 4
# Groups:   institution, year [10]
   institution year  fac   value
   <chr>       <chr> <chr> <int>
 1 a           2005  yes      31
 2 a           2006  no       51
 3 a           2007  no       14
 4 a           2008  no       67
 5 a           2009  yes      42
 6 b           2005  yes      43
 7 b           2006  no       25
 8 b           2007  no       90
 9 b           2008  no       91
10 b           2009  yes      69
ext %>% group_by(institution, year) %>% filter(fac == 'yes'|all(fac == 'no')) %>% 
ungroup() %>% group_by(year) %>% summarise(value = sum(value))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 5 x 2
  year  value
  <chr> <int>
1 2005     74
2 2006     76
3 2007    104
4 2008    158
5 2009    111

Upvotes: 2

Drop rows conditional on value on other rows using dplyr in R

Answers (4)

Related Questions