rob
rob

Reputation: 47

Use dplyr to find rows with specified conditions *and* a set of related rows

Using dplyr on a data frame of population sizes over time, I'd like to identify the set of time points at which the subpopulations first exceed zero, and also the corresponding set of previous time points (i.e. the latest times before subpopulations exceed zero). I can find the first set of time points as follows:

df <- data.frame(time = rep(1:4, each = 3), 
  id = rep(letters[1:3], times = 4), 
  population = c(1, 0, 0, 2, 1, 0, 0, 2, 1, 0, 0, 0))

first_gens <- group_by_(df, ~id) %>%
  filter_(~population > 0) %>%
  summarise_(start_time = ~min(time)) %>%
  ungroup()

In this example, the first time points for subpopulations a, b and c are respectively 1, 2 and 3.

What I can't figure out is an easy way to find the previous time points. In this example, the previous time points for subpopulations a, b and c should be respectively NA, 1 and 2 (dealing with the NA case is unimportant as I can filter out such cases).

Edit: I want a solution that works for an arbitrary sequence of time points.

Any help would be much appreciated.

(NB: I'm using "_" forms of dplyr functions to satisfy CRAN package requirements.)

Upvotes: 1

Views: 42

Answers (1)

Julien Navarre
Julien Navarre

Reputation: 7830

You can use lag

df %>%
  group_by(id) %>%
  summarize(min(time[population > 0]), 
            lag(time)[min(which(population > 0))])

> df %>%
+   group_by(id) %>%
+   summarize(min(time[population > 0]), 
+             lag(time)[min(which(population > 0))])
# A tibble: 3 x 3
  id    `min(time[which(population > 0)])` `lag(time)[min(which(population > 0))]`
  <fct>                              <int>                                   <int>
1 a                                      1                                      NA
2 b                                      2                                       1
3 c  

Upvotes: 1

Related Questions