comcco
comcco

Reputation: 43

How to assign sequential identities to binary states with dplyr?

I am trying to analyze some information about an object as it moves between three possible states. The data is laid out such that each state has its own column, with binary values in sequential order, e.g.,

df <- data.frame(state1 = c(0,0,0,1,1,0,0,1,0,0,0), state2 = c(1,1,1,0,0,1,1,0,1,0,0), state3 = c(0,0,0,0,0,0,0,0,0,1,1))

print(df)

   state1 state2 state3
1       0      1      0
2       0      1      0
3       0      1      0
4       1      0      0
5       1      0      0
6       0      1      0
7       0      1      0
8       1      0      0
9       0      1      0
10      0      0      1
11      0      0      1

I would like to add a fourth column indicating the state being occupied, but while preserving sequence, e.g.,

df2 <- data.frame(state1 = c(0,0,0,1,1,0,0,1,0,0,0), state2 = c(1,1,1,0,0,1,1,0,1,0,0), state3 = c(0,0,0,0,0,0,0,0,0,1,1), state.id = c(2.1, 2.1, 2.1, 1.1, 1.1, 2.2, 2.2, 1.2, 2.3, 3.1, 3.1))

print(df2)

   state1 state2 state3 state.id
1       0      1      0      2.1
2       0      1      0      2.1
3       0      1      0      2.1
4       1      0      0      1.1
5       1      0      0      1.1
6       0      1      0      2.2
7       0      1      0      2.2
8       1      0      0      1.2
9       0      1      0      2.3
10      0      0      1      3.1
11      0      0      1      3.1

How could I go about doing this (preferably via the dplyr package)? Thanks in advance.

Upvotes: 4

Views: 64

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389135

We can use max.col to get maximum number in each row (temp). We also create a new column with row_number(), create a sequential counter within each temp which increments if the difference in row number is greater than 1.

library(dplyr)

df %>%
   mutate(temp = max.col(.), 
          row = row_number()) %>%
   group_by(temp) %>%
   mutate(temp1 = cumsum(row - lag(row, default = first(row)) > 1) + 1,
          state.id = paste(temp, temp1, sep = ".")) %>%
   ungroup %>%
   select(-temp, -temp1, -row)

# A tibble: 11 x 4
#   state1 state2 state3 state.id
#    <dbl>  <dbl>  <dbl> <chr>   
# 1      0      1      0 2.1     
# 2      0      1      0 2.1     
# 3      0      1      0 2.1     
# 4      1      0      0 1.1     
# 5      1      0      0 1.1     
# 6      0      1      0 2.2     
# 7      0      1      0 2.2     
# 8      1      0      0 1.2     
# 9      0      1      0 2.3     
#10      0      0      1 3.1     
#11      0      0      1 3.1     

Upvotes: 3

Calum You
Calum You

Reputation: 15072

This is a little bit complicated because you want to do two things and combine them:

  1. Which state is active in a given row?
  2. How many times has each state now been active?
  3. (combine) how many times has the active state been active?

I think the way to do this in the end is to reshape the data, and use two groupings to get what you want:

  1. Group by row and use which to figure out the state number that is active
  2. Group by state, and use cumsum to get an index that goes up every time a state becomes active.
  3. Then combining, filter to keep only the rows that correspond to the relevant column for each rowid, and right_join back onto the original data.

The second task in particular I think is hard to achieve without reshaping or dealing with an unwieldy number of columns.

library(tidyverse)
df <- data.frame(state1 = c(0,0,0,1,1,0,0,1,0,0,0), state2 = c(1,1,1,0,0,1,1,0,1,0,0), state3 = c(0,0,0,0,0,0,0,0,0,1,1))
df <- rowid_to_column(df)

df %>%
  gather(state, binary, -rowid) %>%
  arrange(state, rowid) %>%
  group_by(rowid) %>%
  mutate(state_num = which(binary == 1)) %>%
  group_by(state) %>%
  mutate(
    state_col_num = str_remove(state, "state") %>% as.integer(),
    sequence = cumsum(binary == 1 & lag(binary, default = 0) == 0)
  ) %>%
  ungroup() %>%
  filter(state_num == state_col_num) %>%
  unite("state.id", state_num, sequence, sep = ".") %>%
  right_join(df, by = "rowid") %>%
  select(rowid, matches("state(\\d+|\\.id)"))
#> # A tibble: 11 x 5
#>    rowid state.id state1 state2 state3
#>    <int> <chr>     <dbl>  <dbl>  <dbl>
#>  1     1 2.1           0      1      0
#>  2     2 2.1           0      1      0
#>  3     3 2.1           0      1      0
#>  4     4 1.1           1      0      0
#>  5     5 1.1           1      0      0
#>  6     6 2.2           0      1      0
#>  7     7 2.2           0      1      0
#>  8     8 1.2           1      0      0
#>  9     9 2.3           0      1      0
#> 10    10 3.1           0      0      1
#> 11    11 3.1           0      0      1

Created on 2019-08-07 by the reprex package (v0.3.0)

Upvotes: 2

Related Questions