Use dplyr to find rows with specified conditions *and* a set of related rows

Question

Using dplyr on a data frame of population sizes over time, I'd like to identify the set of time points at which the subpopulations first exceed zero, and also the corresponding set of previous time points (i.e. the latest times before subpopulations exceed zero). I can find the first set of time points as follows:

df <- data.frame(time = rep(1:4, each = 3), 
  id = rep(letters[1:3], times = 4), 
  population = c(1, 0, 0, 2, 1, 0, 0, 2, 1, 0, 0, 0))

first_gens <- group_by_(df, ~id) %>%
  filter_(~population > 0) %>%
  summarise_(start_time = ~min(time)) %>%
  ungroup()

In this example, the first time points for subpopulations a, b and c are respectively 1, 2 and 3.

What I can't figure out is an easy way to find the previous time points. In this example, the previous time points for subpopulations a, b and c should be respectively NA, 1 and 2 (dealing with the NA case is unimportant as I can filter out such cases).

Edit: I want a solution that works for an arbitrary sequence of time points.

Any help would be much appreciated.

(NB: I'm using "_" forms of dplyr functions to satisfy CRAN package requirements.)

Julien Navarre · Accepted Answer

You can use lag

df %>%
  group_by(id) %>%
  summarize(min(time[population > 0]), 
            lag(time)[min(which(population > 0))])

> df %>%
+   group_by(id) %>%
+   summarize(min(time[population > 0]), 
+             lag(time)[min(which(population > 0))])
# A tibble: 3 x 3
  id    `min(time[which(population > 0)])` `lag(time)[min(which(population > 0))]`
                                                                   
1 a                                      1                                      NA
2 b                                      2                                       1
3 c

Use dplyr to find rows with specified conditions and a set of related rows

Answers (1)

Related Questions