Reputation: 13
I have a dataframe that looks like this:
model | powertrain | start_year | end_year |
---|---|---|---|
car1 | diesel | 2010 | 2015 |
car1 | diesel | 2011 | 2013 |
car1 | electric | 2010 | 2015 |
car1 | diesel | 2015 | 2017 |
car2 | electric | 2012 | 2013 |
car2 | electric | 2010 | 2017 |
And I want to remove all rows where the range (i.e. start_year until end_year) lies within the range of another row with the same model and powertrain (i.e. remove rows: 2 and 5). I tried multiple approach but can't get them to work...
So far, I tried this:
test_range <- df_unique %>% rowwise() %>% mutate(years_range = list(start_year:end_year)) %>% ungroup()
df2 <- df %>% filter(!sapply(years_range, function(x) any(sapply(df$years_range, function(y) all(x %in% y) & !identical(x, y))))) %>% group_by(model, powertrain, years_range) %>% slice_head() %>% ungroup()
Upvotes: 0
Views: 74
Reputation: 72623
You can follow an outer
approach that compares sequences from start to end, obtained with `:()`
, among each other, if all
years are included in another sequence.
> g <- \(x) {
+ if (nrow(x) == 1) {
+ TRUE
+ } else {
+ sqs <- with(x, Map(`:`, start_year, end_year))
+ f <- \(i, j) all(sqs[[i]] %in% sqs[[j]])
+ o <- outer(seq_along(sqs), seq_along(sqs), Vectorize(f))
+ diag(o) <- NA
+ rowSums(o, na.rm=TRUE) != 1
+ }
+ }
Put it in by
that splits along defined by.
vars, apply function and unsplit
.
> by. <- c("model", "powertrain")
> dat[by(dat, dat[by.], FUN=g) |> unsplit(dat[by.]), ]
model powertrain start_year end_year
1 car1 diesel 2010 2015
3 car1 electric 2010 2015
4 car1 diesel 2015 2017
6 car2 electric 2010 2017
Data:
> dput(dat)
structure(list(model = c("car1", "car1", "car1", "car1", "car2",
"car2"), powertrain = c("diesel", "diesel", "electric", "diesel",
"electric", "electric"), start_year = c(2010L, 2011L, 2010L,
2015L, 2012L, 2010L), end_year = c(2015L, 2013L, 2015L, 2017L,
2013L, 2017L)), class = "data.frame", row.names = c(NA, -6L))
Upvotes: 1