Reputation: 984

How to filter a grouped dataframe with a conditional statement using dplyr?

I want to filter a dataframe using dplyr using a conditional. The condition I want to test is whether a country-year combination has two versions.

df <- data.frame(country = c("country1", "country2", "country1", "country2", "country3"), year = rep(2011,5), version = c("versionA", "versionA", "versionB", "versionB", "versionB"))

Here is what I tried after looking here:

df %>%
     group_by(country, year) %>%
     {if unique(version)==1 . else filter(version == "versionA")}

What I am hoping to get is a dataframe that looks like this:

country     year     version

country1    2011     versionA
country2    2011     versionA
country3    2011     versionB

Upvotes: 2

Answers (3)

hello_friend

Reputation: 5798

Base R one-liner thanks (@akrun):

df[!(duplicated(df[1:2])),]

Base R one-liner:

df[!(duplicated(df$country, df$year)),]

Tidyverse solution:

library(tidyverse)
df %>%
  filter(!(duplicated(country, year)))

A more generic base R solution:

# Create a counter of versions for each year and country: 

df$tmp <- with(lapply(df, function(x){if(is.factor(x)){as.character(x)}else{x}}),
               ave(version, paste0(country, year), FUN = seq.int))

# Subset the dataframe to hold only the first record for each year/country: 

df[which(df$tmp == 1), ]

A more generic tidyverse solution:

df %>%
  arrange(version) %>% 
  filter(!(duplicated(country, year)))

Upvotes: 1

akrun

Reputation: 887901

After grouping by 'country', 'year', filter if number of distinct elements is greater than 1 return the 'versionA', or else return the first element

library(dplyr)
df %>%
  group_by(country, year)  %>% 
  filter((n_distinct(version)  > 1 & version == 'versionA')|row_number() == 1)
# A tibble: 3 x 3
# Groups:   country, year [3]
#  country   year version 
#  <fct>    <dbl> <fct>   
#1 country1  2011 versionA
#2 country2  2011 versionA
#3 country3  2011 versionB

Or this can be added in a if/else

df %>%
    group_by(country, year) %>%
    filter(if(n_distinct(version) > 1) version == 'versionA'
       else row_number() ==1)
# A tibble: 3 x 3
# Groups:   country, year [3]
#  country   year version 
#  <fct>    <dbl> <fct>   
#1 country1  2011 versionA
#2 country2  2011 versionA
#3 country3  2011 versionB

Or another option is arrange

df %>% 
    arrange(country, year, version != 'versionA') %>% 
    group_by(country, year) %>% 
    slice(1)

Or with summarize

df %>%
    group_by(country, year) %>%
    summarise(version = if(n_distinct(version) > 1) 'versionA' else first(version))

Or using data.table

library(data.table)
setDT(df)[, .SD[if(n_distinct(version) > 1) version == 'versionA' 
          else 1], .(country, year)]

Upvotes: 1

Ronak Shah

Reputation: 389275

To count the number of unique values we can use n_distinct and filter the rows based on that.

library(dplyr)

df %>%
  group_by(country, year) %>%
  filter(if(n_distinct(version) == 2) version == 'versionA' else TRUE)


#  country   year version 
#  <fct>    <dbl> <fct>   
#1 country1  2011 versionA
#2 country2  2011 versionA
#3 country3  2011 versionB

Upvotes: 2

How to filter a grouped dataframe with a conditional statement using dplyr?

Answers (3)

Related Questions