user1231088
user1231088

Reputation: 357

expand data.frame to long format and increment value

I would like to convert my data from a short format to a long format and I imagine there is a simple way to do it (possibly with reshape2, plyr, dplyr, etc?).

For example, I have:

foo <- data.frame(id = 1:5, 
              y = c(0, 1, 0, 1, 0),
              time = c(2, 3, 4, 2, 3))

id y time
1  0  2
2  1  3
3  0  4
4  1  2
5  0  3

I would like to expand/copy each row n times, where n is that row's value in the "time" column. However, I would also like the variable "time" to be incremented from 1 to n. That is, I would like to produce:

id  y time
1   0   1
1   0   2
2   1   1
2   1   2
2   1   3
3   0   1
3   0   2
3   0   3
3   0   4
4   1   1
4   1   2
5   0   1
5   0   2
5   0   3

As a bonus, I would also like to do a sort of incrementing of the variable "y" where, for those ids with y = 1, y is set to 0 until the largest value of "time". That is, I would like to produce:

id  y time
1   0   1
1   0   2
2   0   1
2   0   2
2   1   3
3   0   1
3   0   2
3   0   3
3   0   4
4   0   1
4   1   2
5   0   1
5   0   2
5   0   3

This seems like something that dplyr might already do, but I just don't know where to look. Regardless, any solution that avoids loops is helpful.

Upvotes: 3

Views: 522

Answers (4)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

If you're willing to go with "data.table", you can try:

library(data.table)
fooDT <- as.data.table(foo)
fooDT[, list(time = sequence(time)), by = list(id, y)]
#     id y time
#  1:  1 0    1
#  2:  1 0    2
#  3:  2 1    1
#  4:  2 1    2
#  5:  2 1    3
#  6:  3 0    1
#  7:  3 0    2
#  8:  3 0    3
#  9:  3 0    4
# 10:  4 1    1
# 11:  4 1    2
# 12:  5 0    1
# 13:  5 0    2
# 14:  5 0    3

And, for the bonus question:

fooDT[, list(time = sequence(time)), 
      by = list(id, y)][, y := {y[1:(.N-1)] <- 0; y}, 
                        by = id][]
#     id y time
#  1:  1 0    1
#  2:  1 0    2
#  3:  2 0    1
#  4:  2 0    2
#  5:  2 1    3
#  6:  3 0    1
#  7:  3 0    2
#  8:  3 0    3
#  9:  3 0    4
# 10:  4 0    1
# 11:  4 1    2
# 12:  5 0    1
# 13:  5 0    2
# 14:  5 0    3

For the bonus question, alternatively:

fooDT[, list(time=seq_len(time)), by=list(id,y)][y == 1, 
                y := c(rep.int(0, .N-1L), 1), by=id][]

Upvotes: 3

thelatemail
thelatemail

Reputation: 93938

The initial expansion can be achieved with:

newdat <- transform( 
  foo[rep(rownames(foo),foo$time),], 
  time = sequence(foo$time)
)

#    id y time
#1    1 0    1
#1.1  1 0    2
#2    2 1    1
#2.1  2 1    2
#2.2  2 1    3
# etc

To get the complete solution, including the bonus part, then do:

newdat$y[-cumsum(foo$time)] <- 0

#    id y time
#1    1 0    1
#1.1  1 0    2
#2    2 0    1
#2.1  2 0    2
#2.2  2 1    3
#etc

If you were really excitable, you could do it all in one step using within:

within(
  foo[rep(rownames(foo),foo$time),],
  {
    time <- sequence(foo$time)
    y[-cumsum(foo$time)] <- 0
  }
)

Upvotes: 3

Athos
Athos

Reputation: 660

With dplyr (and magritte for nice legibility):

library(magrittr)
library(dplyr)

foo[rep(1:nrow(foo), foo$time), ] %>%
    group_by(id) %>%
    mutate(y = !duplicated(y, fromLast = TRUE),
                  time = 1:n())

Hope it helps

Upvotes: 0

Matthew Lundberg
Matthew Lundberg

Reputation: 42689

You can create a new data frame with the proper id and time columns for the long format, then merge that with the original. This leaves NA for the unmatched values, which can then be substituted with 0:

merge(foo, 
      with(foo, 
           data.frame(id=rep(id,time), time=sequence(time))
      ), 
      all.y=TRUE
)
##    id time  y
## 1   1    1 NA
## 2   1    2  0
## 3   2    1 NA
## 4   2    2 NA
## 5   2    3  1
## 6   3    1 NA
## 7   3    2 NA
## 8   3    3 NA
## 9   3    4  0
## 10  4    1 NA
## 11  4    2  1
## 12  5    1 NA
## 13  5    2 NA
## 14  5    3  0

A similar merge works for the first expansion. Merge foo without the time column with the same created data frame as above:

merge(foo[c('id','y')], 
      with(foo, 
           data.frame(id=rep(id,time), time=sequence(time))
      )
) 
##    id y time
## 1   1 0    1
## 2   1 0    2
## 3   2 1    1
## 4   2 1    2
## 5   2 1    3
## 6   3 0    1
## 7   3 0    2
## 8   3 0    3
## 9   3 0    4
## 10  4 1    1
## 11  4 1    2
## 12  5 0    1
## 13  5 0    2
## 14  5 0    3

It's not necessary to specify all (or all.y) in the latter expression because there are multiple time values for each matching id value, and these are expanded. In the prior case, the time values were matched from both data frames, and without specifying all (or all.y) you would get your original data back.

Upvotes: 3

Related Questions