FilipeTeixeira
FilipeTeixeira

Reputation: 1160

Regroup dataset

With the dataset below:

  id from to trip
1  1    A  B <NA>
2  1    B  C    X
3  1    C  D <NA>
4  1    D  A    X
5  2    B  A    X
6  2    A  C <NA>
7  2    C  D <NA>

I'm trying to split the data whenever I reach an X on the trip column, in order to obtain something like this(note that there's a new column):

  id from to trip group
1  1    A  B <NA>  1
2  1    B  C    X  1
3  1    C  D <NA>  2
4  1    D  A    X  2
5  2    B  A    X  3
6  2    A  C <NA>  4
7  2    C  D <NA>  4

What this means is that it goes through the observations/rows, and based on the id and trip column, it starts a new number every time it reaches an X.

I've tried a couple of things with group_by(df, id, trip), but I always end up without a similar structure to that one.

Any suggestions?

Upvotes: 1

Views: 143

Answers (2)

lmo
lmo

Reputation: 38500

This base R method works for the example:

df$group <- c(1, head(cumsum(df$trip == "X") + 1, -1))

In the example, it returns

df
  id from to trip group
1  1    A  B <NA>     1
2  1    B  C    X     1
3  1    C  D <NA>     2
4  1    D  A    X     2
5  2    B  A    X     3
6  2    A  C <NA>     4
7  2    C  D <NA>     4

Also, in the case that the first observation is "X", then 1 will be the first element and 2 will be the second element, which I believe is what is desired.

As @manotheshark mentions in the commments, this solution will not work on NA values, because cumsum will return NA after the first NA is encountered. The suggested solution would be to replace df$trip == "X" with is.na(df$trip) which will return the correct values in the presence of NAs.

Upvotes: 2

Alexvonrass
Alexvonrass

Reputation: 330

Try this (if you actually need to start new group on row after X i think you can achieve that with another shift() on the trip == "X":

library(data.table)
set.seed(1)

na.zero <- function (x) {
  x[is.na(x)] <- 0
  return(x)
}

dt <- data.table(id = c(1,1,1,2,2,2),
                 from = sample(c("A", "B", "C", "D"), 6, replace = T),
                 to = sample(c("A", "B", "C", "D"), 6, replace = T),
                 trip = sample(c("X", NA), 6, replace = T))

dt[, group:=(cumsum(na.zero(trip=="X"))+cumsum(id-na.zero(shift(id, 1L,type = 'lag'))))]

   id from to trip group
1:  1    B  D   NA     1
2:  1    B  C    X     2
3:  1    C  C   NA     2
4:  2    D  A    X     4
5:  2    A  A   NA     4
6:  2    D  A   NA     4

Upvotes: 0

Related Questions