Reputation:

How to subset identical rows from dataframe by condition in R?

From this dataframe I want to subset rows which have the identical "model", "Country" and "year" values but different "factor"

model <- c("A","B","C","A","A","C","B","A")
country <- c("Italy","Germany","Norway","Italy","France","Germany","Norway","Italy")
year <- c(2016,2016,2016,2016,2015,2015,2015,2015)
value <-  c(14,24,11,34,73,11,33,22)
factor <- c("NEW","OLD","OLD","OLD","OLD","OLD","NEW","NEW")

df <- data.frame(model,country,year,value, factor)

  model country year value factor
1     A   Italy 2016    14    NEW
2     B Germany 2016    24    OLD
3     C  Norway 2016    11    OLD
4     A   Italy 2016    34    OLD
5     A  France 2015    73    OLD
6     C Germany 2015    11    OLD
7     B  Norway 2015    33    NEW
8     A   Italy 2015    22    NEW

Expected output:

1     A   Italy 2016    22    NEW
4     A   Italy 2016    34    OLD

Upvotes: 4

Answers (3)

AndrewGB

Reputation: 16836

You can group by model, country, and year, then use filter to keep rows that have duplicated values for model, country, and year. Then, to ensure that there are different factor values, we use n_distinct to make sure there are at least 2 distinct values in factor (i.e., the group must have NEW and OLD).

library(dplyr)
    
df %>%
  group_by(model, country, year) %>%
  filter(n() > 1 & n_distinct(factor) > 1)

Output

  model country  year value factor
  <chr> <chr>   <dbl> <dbl> <chr> 
1 A     Italy    2016    14 NEW   
2 A     Italy    2016    34 OLD

Additional Example: Multiple Factors in Each Group

If you only want to keep 1 row of different factors (i.e., NEW and OLD) in each group, then we could add factor as an additional grouping variable and use slice to only keep the first row of each group.

df2 <- structure(list(model = c("A", "B", "C", "A", "A", "C", "B", "A"
), country = c("Italy", "Germany", "Norway", "Italy", "France", 
"Germany", "Norway", "Italy"), year = c(2016, 2016, 2016, 2016, 
2015, 2015, 2015, 2016), value = c(14, 34, 11, 34, 33, 11, 33, 
22), factor = c("NEW", "OLD", "OLD", "OLD", "OLD", "OLD", "NEW", 
"NEW")), class = "data.frame", row.names = c(NA, -8L))

#  model country year value factor
#1     A   Italy 2016    14    NEW
#2     B Germany 2016    34    OLD
#3     C  Norway 2016    11    OLD
#4     A   Italy 2016    34    OLD
#5     A  France 2015    33    OLD
#6     C Germany 2015    11    OLD
#7     B  Norway 2015    33    NEW
#8     A   Italy 2016    22    NEW

df2 %>%
  group_by(model, country, year) %>%
  filter(n() > 1 & n_distinct(factor) > 1) %>%
  group_by(factor, .add = T) %>%
  slice(1)

Output

  model country  year value factor
  <chr> <chr>   <dbl> <dbl> <chr> 
1 A     Italy    2016    14 NEW   
2 A     Italy    2016    34 OLD

Upvotes: 2

ThomasIsCoding

Reputation: 101034

In base R, you can use subset + ave

> subset(df, ave(seq_along(year), model, country, year, FUN = length) > 1)
  model country year value factor
1     A   Italy 2016    14    NEW
4     A   Italy 2016    34    OLD

Or, try Filter + split

> Filter(function(x) nrow(x) > 1, split(df, ~ model + country + year))
$A.Italy.2016
  model country year value factor
1     A   Italy 2016    14    NEW
4     A   Italy 2016    34    OLD

> do.call(rbind, unname(Filter(function(x) nrow(x) > 1, split(df, ~ model + country + year))))
  model country year value factor
1     A   Italy 2016    14    NEW
4     A   Italy 2016    34    OLD

Upvotes: 2

Quinten

Reputation: 41225

What you could do is create a column which tells the number of duplicated based on the groups (model, country and year) and then filter like this:

model <- c("A","B","C","A","A","C","B","A")
country <- c("Italy","Germany","Norway","Italy","France","Germany","Norway","Italy")
year <- c(2016,2016,2016,2016,2015,2015,2015,2015)
value <-  c(14,24,11,34,73,11,33,22)
factor <- c("NEW","OLD","OLD","OLD","OLD","OLD","NEW","NEW")

df <- data.frame(model,country,year,value, factor)

library(dplyr)

df %>% 
  group_by(model, country, year) %>% 
  mutate(number_dups = n()) %>% 
  filter(number_dups > 1) %>%
  select(-number_dups) %>%
  ungroup()
#> # A tibble: 2 × 5
#>   model country  year value factor
#>   <chr> <chr>   <dbl> <dbl> <chr> 
#> 1 A     Italy    2016    14 NEW   
#> 2 A     Italy    2016    34 OLD

^{Created on 2022-08-12 by the reprex package (v2.0.1)}

Please note: I assume that the expected output should be 14 instead of 22.

Upvotes: 3

How to subset identical rows from dataframe by condition in R?

Answers (3)

Related Questions