Filter out rows that lie within the timeframe of other rows

Question

I have a dataframe that looks like this:

model	powertrain	start_year	end_year
car1	diesel	2010	2015
car1	diesel	2011	2013
car1	electric	2010	2015
car1	diesel	2015	2017
car2	electric	2012	2013
car2	electric	2010	2017

And I want to remove all rows where the range (i.e. start_year until end_year) lies within the range of another row with the same model and powertrain (i.e. remove rows: 2 and 5). I tried multiple approach but can't get them to work...

So far, I tried this:

Create a range variable: test_range <- df_unique %>% rowwise() %>% mutate(years_range = list(start_year:end_year)) %>% ungroup()
Remove the rows that are within the range of other variables: df2 <- df %>% filter(!sapply(years_range, function(x) any(sapply(df$years_range, function(y) all(x %in% y) & !identical(x, y))))) %>% group_by(model, powertrain, years_range) %>% slice_head() %>% ungroup()

jay.sf · Accepted Answer

You can follow an outer approach that compares sequences from start to end, obtained with `:()`, among each other, if all years are included in another sequence.

> g <- \(x) {
+   if (nrow(x) == 1) {
+     TRUE
+   } else {
+     sqs <- with(x, Map(`:`, start_year, end_year))
+     f <- \(i, j) all(sqs[[i]] %in% sqs[[j]])
+     o <- outer(seq_along(sqs), seq_along(sqs), Vectorize(f))
+     diag(o) <- NA
+     rowSums(o, na.rm=TRUE) != 1
+   }
+ }

Put it in by that splits along defined by. vars, apply function and unsplit.

> by. <- c("model", "powertrain")
> dat[by(dat, dat[by.], FUN=g) |> unsplit(dat[by.]), ]
  model powertrain start_year end_year
1  car1     diesel       2010     2015
3  car1   electric       2010     2015
4  car1     diesel       2015     2017
6  car2   electric       2010     2017

Data:

> dput(dat)
structure(list(model = c("car1", "car1", "car1", "car1", "car2", 
"car2"), powertrain = c("diesel", "diesel", "electric", "diesel", 
"electric", "electric"), start_year = c(2010L, 2011L, 2010L, 
2015L, 2012L, 2010L), end_year = c(2015L, 2013L, 2015L, 2017L, 
2013L, 2017L)), class = "data.frame", row.names = c(NA, -6L))

Filter out rows that lie within the timeframe of other rows

Answers (1)

Related Questions