hy9fesh
hy9fesh

Reputation: 661

How to identify changes name in dplyr() groups?

I am trying to figure out how to identify name changes within a group.

For example, I have a dataframe that looks like this:

    df <- data.frame(
  state = rep(c("CA", "WI", "NY"), each = 3),
  year = rep(c(2000, 2001), each = 9),
  name = c("John", "Paul", "Sally",
           "Mary", "Fred", "Jane",
           "Linda", "Carl", "Jim",
           "Peter", "Paul", "Sally",
           "Mary", "Kate", "Jane",
           "Linda", "Carl", "Jim")
)

    > df
   state year  name
1     CA 2000  John
2     CA 2000  Paul
3     CA 2000 Sally
4     WI 2000  Mary
5     WI 2000  Fred
6     WI 2000  Jane
7     NY 2000 Linda
8     NY 2000  Carl
9     NY 2000   Jim
10    CA 2001 Peter
11    CA 2001  Paul
12    CA 2001 Sally
13    WI 2001  Mary
14    WI 2001  Kate
15    WI 2001  Jane
16    NY 2001 Linda
17    NY 2001  Carl
18    NY 2001   Jim

As you can see, "Peter" replaced "John" in 2001, and "Kate" replaced "Fred" in 2001.

So I want the output to look like:

df <- data.frame(
  state = rep(c("CA", "WI", "NY"), each = 3),
  year = rep(c(2000, 2001), each = 9),
  name = c("John", "Paul", "Sally",
           "Mary", "Fred", "Jane",
           "Linda", "Carl", "Jim",
           "Peter", "Paul", "Sally",
           "Mary", "Kate", "Jane",
           "Linda", "Carl", "Jim"),
  change = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
             1, 0, 0, 0, 1, 0, 0, 0, 0)
)

   > df
   state year  name change
1     CA 2000  John     NA
2     CA 2000  Paul     NA
3     CA 2000 Sally     NA
4     WI 2000  Mary     NA
5     WI 2000  Fred     NA
6     WI 2000  Jane     NA
7     NY 2000 Linda     NA
8     NY 2000  Carl     NA
9     NY 2000   Jim     NA
10    CA 2001 Peter      1
11    CA 2001  Paul      0
12    CA 2001 Sally      0
13    WI 2001  Mary      0
14    WI 2001  Kate      1
15    WI 2001  Jane      0
16    NY 2001 Linda      0
17    NY 2001  Carl      0
18    NY 2001   Jim      0

As you can see, Peter in 2001 and Kate in 2001 are both marked as "1" in the "change" column because they replaced "John" and "Fred" in 2000-CA and 2000-NY, respectively.

I've been looking at using some lag methods, but it seems to just look at the previous row, not by state, year groups:

df2 <- df %>% 
  group_by(state, year) %>%
  mutate(change = lag(name, order_by = year))

Any help would be appreciated!

Upvotes: 1

Views: 59

Answers (2)

akrun
akrun

Reputation: 887531

Based on the expected output, maybe this helps - create a logical column based on the duplicated 'name' in the entire data, then grouped by 'year', if all values are FALSE (!change), then replace with NA or else convert the logical to binary (+)

library(dplyr)
df %>%
    mutate(rn = row_number()) %>%
    arrange(state, year, name) %>%
    group_by(state) %>%
    mutate(change = !duplicated(name)) %>% 
    group_by(year, .add = TRUE) %>%
    mutate(
      change = if(all(change)) NA_integer_ else +(change)) %>% 
   ungroup %>% 
   arrange(rn) %>% 
   select(-rn)

-output

# A tibble: 18 × 4
   state  year name  change
   <chr> <dbl> <chr>  <int>
 1 CA     2000 John      NA
 2 CA     2000 Paul      NA
 3 CA     2000 Sally     NA
 4 WI     2000 Mary      NA
 5 WI     2000 Fred      NA
 6 WI     2000 Jane      NA
 7 NY     2000 Linda     NA
 8 NY     2000 Carl      NA
 9 NY     2000 Jim       NA
10 CA     2001 Peter      1
11 CA     2001 Paul       0
12 CA     2001 Sally      0
13 WI     2001 Mary       0
14 WI     2001 Kate       1
15 WI     2001 Jane       0
16 NY     2001 Linda      0
17 NY     2001 Carl       0
18 NY     2001 Jim        0

Using another data showed in the comments

df2 <- structure(list(state = c("AK", "AK", "AK", "AK", "AK", "AK",  
"AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", 
 "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK"), year = c(1997L,  
1998L, 1995L, 1996L, 1997L, 1995L, 1996L, 1998L, 1997L, 1998L, 
 1996L, 1995L, 1996L, 1997L, 1998L, 1995L, 1996L, 1997L, 1998L,  
1995L, 1996L, 1995L, 1996L, 1997L, 1998L), name = c("A", "A",  "A", 
"A", "B", "B", "B", "B", "C", "C", "C", "C", "D", "D", "D",  "E", 
"E", "E", "E", "F", "F", "G", "G", "G", "G")), class = "data.frame", 
row.names = c(NA,  -25L))
df2 %>%
    mutate(rn = row_number()) %>%
    arrange(state, year, name) %>%
    group_by(state) %>%
    mutate(change = !duplicated(name)) %>% 
    group_by(year, .add = TRUE) %>%
    mutate(
      change = if(all(change)) NA_integer_ else +(change)) %>% 
   ungroup %>% 
   arrange(rn) %>% 
   select(-rn) %>%
   as.data.frame

-output

 state year name change
1     AK 1997    A      0
2     AK 1998    A      0
3     AK 1995    A     NA
4     AK 1996    A      0
5     AK 1997    B      0
6     AK 1995    B     NA
7     AK 1996    B      0
8     AK 1998    B      0
9     AK 1997    C      0
10    AK 1998    C      0
11    AK 1996    C      0
12    AK 1995    C     NA
13    AK 1996    D      1
14    AK 1997    D      0
15    AK 1998    D      0
16    AK 1995    E     NA
17    AK 1996    E      0
18    AK 1997    E      0
19    AK 1998    E      0
20    AK 1995    F     NA
21    AK 1996    F      0
22    AK 1995    G     NA
23    AK 1996    G      0
24    AK 1997    G      0
25    AK 1998    G      0

Upvotes: 2

Andre Wildberg
Andre Wildberg

Reputation: 19163

A base R approach that leaves out NAs

df2 <- split(df, df$year)

cbind(df, change=rep((!(df2$"2000"$name == df2$"2001"$name))*1, length(df2)))
   state year  name change
1     CA 2000  John      1
2     CA 2000  Paul      0
3     WI 2000 Sally      0
4     WI 2000  Mary      0
5     NY 2000  Fred      1
6     NY 2000  Jane      0
7     CA 2000 Linda      0
8     CA 2000  Carl      0
9     WI 2000   Jim      0
10    WI 2001 Peter      1
11    NY 2001  Paul      0
12    NY 2001 Sally      0
13    CA 2001  Mary      0
14    CA 2001  Kate      1
15    WI 2001  Jane      0
16    WI 2001 Linda      0
17    NY 2001  Carl      0
18    NY 2001   Jim      0

Upvotes: 2

Related Questions