Reputation:
From this dataframe I want to subset rows which have the identical "model", "Country" and "year" values but different "factor"
model <- c("A","B","C","A","A","C","B","A")
country <- c("Italy","Germany","Norway","Italy","France","Germany","Norway","Italy")
year <- c(2016,2016,2016,2016,2015,2015,2015,2015)
value <- c(14,24,11,34,73,11,33,22)
factor <- c("NEW","OLD","OLD","OLD","OLD","OLD","NEW","NEW")
df <- data.frame(model,country,year,value, factor)
model country year value factor
1 A Italy 2016 14 NEW
2 B Germany 2016 24 OLD
3 C Norway 2016 11 OLD
4 A Italy 2016 34 OLD
5 A France 2015 73 OLD
6 C Germany 2015 11 OLD
7 B Norway 2015 33 NEW
8 A Italy 2015 22 NEW
Expected output:
1 A Italy 2016 22 NEW
4 A Italy 2016 34 OLD
Upvotes: 4
Views: 185
Reputation: 16836
You can group by model
, country
, and year
, then use filter
to keep rows that have duplicated values for model
, country
, and year
. Then, to ensure that there are different factor values, we use n_distinct
to make sure there are at least 2 distinct values in factor
(i.e., the group must have NEW
and OLD
).
library(dplyr)
df %>%
group_by(model, country, year) %>%
filter(n() > 1 & n_distinct(factor) > 1)
Output
model country year value factor
<chr> <chr> <dbl> <dbl> <chr>
1 A Italy 2016 14 NEW
2 A Italy 2016 34 OLD
Additional Example: Multiple Factors in Each Group
If you only want to keep 1 row of different factors (i.e., NEW
and OLD
) in each group, then we could add factor
as an additional grouping variable and use slice
to only keep the first row of each group.
df2 <- structure(list(model = c("A", "B", "C", "A", "A", "C", "B", "A"
), country = c("Italy", "Germany", "Norway", "Italy", "France",
"Germany", "Norway", "Italy"), year = c(2016, 2016, 2016, 2016,
2015, 2015, 2015, 2016), value = c(14, 34, 11, 34, 33, 11, 33,
22), factor = c("NEW", "OLD", "OLD", "OLD", "OLD", "OLD", "NEW",
"NEW")), class = "data.frame", row.names = c(NA, -8L))
# model country year value factor
#1 A Italy 2016 14 NEW
#2 B Germany 2016 34 OLD
#3 C Norway 2016 11 OLD
#4 A Italy 2016 34 OLD
#5 A France 2015 33 OLD
#6 C Germany 2015 11 OLD
#7 B Norway 2015 33 NEW
#8 A Italy 2016 22 NEW
df2 %>%
group_by(model, country, year) %>%
filter(n() > 1 & n_distinct(factor) > 1) %>%
group_by(factor, .add = T) %>%
slice(1)
Output
model country year value factor
<chr> <chr> <dbl> <dbl> <chr>
1 A Italy 2016 14 NEW
2 A Italy 2016 34 OLD
Upvotes: 2
Reputation: 101034
In base R, you can use subset
+ ave
> subset(df, ave(seq_along(year), model, country, year, FUN = length) > 1)
model country year value factor
1 A Italy 2016 14 NEW
4 A Italy 2016 34 OLD
Or, try Filter
+ split
> Filter(function(x) nrow(x) > 1, split(df, ~ model + country + year))
$A.Italy.2016
model country year value factor
1 A Italy 2016 14 NEW
4 A Italy 2016 34 OLD
or
> do.call(rbind, unname(Filter(function(x) nrow(x) > 1, split(df, ~ model + country + year))))
model country year value factor
1 A Italy 2016 14 NEW
4 A Italy 2016 34 OLD
Upvotes: 2
Reputation: 41225
What you could do is create a column which tells the number of duplicated based on the groups (model, country and year) and then filter
like this:
model <- c("A","B","C","A","A","C","B","A")
country <- c("Italy","Germany","Norway","Italy","France","Germany","Norway","Italy")
year <- c(2016,2016,2016,2016,2015,2015,2015,2015)
value <- c(14,24,11,34,73,11,33,22)
factor <- c("NEW","OLD","OLD","OLD","OLD","OLD","NEW","NEW")
df <- data.frame(model,country,year,value, factor)
library(dplyr)
df %>%
group_by(model, country, year) %>%
mutate(number_dups = n()) %>%
filter(number_dups > 1) %>%
select(-number_dups) %>%
ungroup()
#> # A tibble: 2 × 5
#> model country year value factor
#> <chr> <chr> <dbl> <dbl> <chr>
#> 1 A Italy 2016 14 NEW
#> 2 A Italy 2016 34 OLD
Created on 2022-08-12 by the reprex package (v2.0.1)
Please note: I assume that the expected output should be 14 instead of 22.
Upvotes: 3