user18937485
user18937485

Reputation:

How to subset identical rows from dataframe by condition in R?

From this dataframe I want to subset rows which have the identical "model", "Country" and "year" values but different "factor"

model <- c("A","B","C","A","A","C","B","A")
country <- c("Italy","Germany","Norway","Italy","France","Germany","Norway","Italy")
year <- c(2016,2016,2016,2016,2015,2015,2015,2015)
value <-  c(14,24,11,34,73,11,33,22)
factor <- c("NEW","OLD","OLD","OLD","OLD","OLD","NEW","NEW")

df <- data.frame(model,country,year,value, factor)

  model country year value factor
1     A   Italy 2016    14    NEW
2     B Germany 2016    24    OLD
3     C  Norway 2016    11    OLD
4     A   Italy 2016    34    OLD
5     A  France 2015    73    OLD
6     C Germany 2015    11    OLD
7     B  Norway 2015    33    NEW
8     A   Italy 2015    22    NEW

Expected output:

1     A   Italy 2016    22    NEW
4     A   Italy 2016    34    OLD

Upvotes: 4

Views: 185

Answers (3)

AndrewGB
AndrewGB

Reputation: 16836

You can group by model, country, and year, then use filter to keep rows that have duplicated values for model, country, and year. Then, to ensure that there are different factor values, we use n_distinct to make sure there are at least 2 distinct values in factor (i.e., the group must have NEW and OLD).

library(dplyr)
    
df %>%
  group_by(model, country, year) %>%
  filter(n() > 1 & n_distinct(factor) > 1)

Output

  model country  year value factor
  <chr> <chr>   <dbl> <dbl> <chr> 
1 A     Italy    2016    14 NEW   
2 A     Italy    2016    34 OLD 

Additional Example: Multiple Factors in Each Group

If you only want to keep 1 row of different factors (i.e., NEW and OLD) in each group, then we could add factor as an additional grouping variable and use slice to only keep the first row of each group.

df2 <- structure(list(model = c("A", "B", "C", "A", "A", "C", "B", "A"
), country = c("Italy", "Germany", "Norway", "Italy", "France", 
"Germany", "Norway", "Italy"), year = c(2016, 2016, 2016, 2016, 
2015, 2015, 2015, 2016), value = c(14, 34, 11, 34, 33, 11, 33, 
22), factor = c("NEW", "OLD", "OLD", "OLD", "OLD", "OLD", "NEW", 
"NEW")), class = "data.frame", row.names = c(NA, -8L))

#  model country year value factor
#1     A   Italy 2016    14    NEW
#2     B Germany 2016    34    OLD
#3     C  Norway 2016    11    OLD
#4     A   Italy 2016    34    OLD
#5     A  France 2015    33    OLD
#6     C Germany 2015    11    OLD
#7     B  Norway 2015    33    NEW
#8     A   Italy 2016    22    NEW

df2 %>%
  group_by(model, country, year) %>%
  filter(n() > 1 & n_distinct(factor) > 1) %>%
  group_by(factor, .add = T) %>%
  slice(1) 

Output

  model country  year value factor
  <chr> <chr>   <dbl> <dbl> <chr> 
1 A     Italy    2016    14 NEW   
2 A     Italy    2016    34 OLD   

Upvotes: 2

ThomasIsCoding
ThomasIsCoding

Reputation: 101034

In base R, you can use subset + ave

> subset(df, ave(seq_along(year), model, country, year, FUN = length) > 1)
  model country year value factor
1     A   Italy 2016    14    NEW
4     A   Italy 2016    34    OLD

Or, try Filter + split

> Filter(function(x) nrow(x) > 1, split(df, ~ model + country + year))
$A.Italy.2016
  model country year value factor
1     A   Italy 2016    14    NEW
4     A   Italy 2016    34    OLD

or

> do.call(rbind, unname(Filter(function(x) nrow(x) > 1, split(df, ~ model + country + year))))
  model country year value factor
1     A   Italy 2016    14    NEW
4     A   Italy 2016    34    OLD

Upvotes: 2

Quinten
Quinten

Reputation: 41225

What you could do is create a column which tells the number of duplicated based on the groups (model, country and year) and then filter like this:

model <- c("A","B","C","A","A","C","B","A")
country <- c("Italy","Germany","Norway","Italy","France","Germany","Norway","Italy")
year <- c(2016,2016,2016,2016,2015,2015,2015,2015)
value <-  c(14,24,11,34,73,11,33,22)
factor <- c("NEW","OLD","OLD","OLD","OLD","OLD","NEW","NEW")

df <- data.frame(model,country,year,value, factor)

library(dplyr)

df %>% 
  group_by(model, country, year) %>% 
  mutate(number_dups = n()) %>% 
  filter(number_dups > 1) %>%
  select(-number_dups) %>%
  ungroup()
#> # A tibble: 2 × 5
#>   model country  year value factor
#>   <chr> <chr>   <dbl> <dbl> <chr> 
#> 1 A     Italy    2016    14 NEW   
#> 2 A     Italy    2016    34 OLD

Created on 2022-08-12 by the reprex package (v2.0.1)

Please note: I assume that the expected output should be 14 instead of 22.

Upvotes: 3

Related Questions