Sean Norton
Sean Norton

Reputation: 287

Find cumulative value change and time since last change in panel data

I have panel data (small example of data below), and want to calculate both when a variable changes as well as the time from the last change. The end goal is to get two variables: cumulative change in any given year (i.e. the difference between the current value and the starting value), and the time since the last change.

# example data
structure(list(rcode = c("DE1", "DE1", "DE1", "DE1", "DE1", "DE1", "DE1", "DE1", "DE1", 
                         "DE1", "DE1", "DE2", "DE2", "DE2", "DE2", "DE2", "DE2", "DE2", 
                         "DE2", "DE2", "DE2", "DE2"), 
               n_RAI = c(34.47871, 34.91596, 34.91596, 34.88745, 34.79049, 34.79049, 
                         34.79049, 35.48414, 35.98985, 35.98985, 36.98985, 34.47871, 
                         34.91596, 34.91596, 34.88745, 34.79049, 34.79049, 34.79049, 
                         35.48414, 35.98985, 35.98985, 36.98985), 
               year = c(2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 2007L, 2008L, 
                        2009L, 2010L, 2000L, 2001L, 2002L, 2003L, 2004L, 2005L, 2006L, 
                        2007L, 2008L, 2009L, 2010L)), 
           row.names = 375:396, class = "data.frame")

In this data, r_code is individual identifier, n_RAI is the variable I'm trying to lag, and year is the time variable.

It's fairly simple to identify when a change occurred for each individual in the panel, using the answer here.

Where I'm struggling is to come up with a function that, using dplyr::group_by or tapply to apply across individuals in the panel, can also calculate the time since the last change. Essentially, I need a function that identifies when a change occurs, starts counting upwards moving forwards, then resets when another change occurs. It seems simple, but I've hit some mental block and can't think of how to translate that into R. Is there a function out there or a simple base R way of doing this that I'm failing to see?

e.g. the expected output for individual DE1, using "change" as a variable name for identifying a change and "t_since" as the time from the last change:

change = 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0
t_since = 0, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2

Upvotes: 1

Views: 299

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 146249

I think this works. It's a little convoluted. The t_since result matches your desired result, the change column is a little different than what you show in the question, but I think that might be a typo in your question. (Calling your data dd.)

dd = dd %>%
  group_by(rcode) %>%
  mutate(
    change = c(0, abs(diff(n_RAI)) > 1e-8),
    grouper = (cumsum(change) - 1) %/% 2
  ) %>%
  group_by(rcode, grouper) %>%
  mutate(t_since = pmax(0, 0:(n() - 1))) %>%
  ungroup() %>%
  select(-grouper)

change = c(0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0)
t_since = c(0, 0, 1, 2, 0, 1, 2, 3, 0, 1, 2)

dd = cbind(dd, goal_change = change, goal_t_since = t_since)

all(dd$goal_change == dd$change)
# [1] FALSE
all(dd$goal_t_since == dd$t_since)
# [1] TRUE

Upvotes: 1

Related Questions