Reputation: 357
I would like to convert my data from a short format to a long format and I imagine there is a simple way to do it (possibly with reshape2, plyr, dplyr, etc?).
For example, I have:
foo <- data.frame(id = 1:5,
y = c(0, 1, 0, 1, 0),
time = c(2, 3, 4, 2, 3))
id y time
1 0 2
2 1 3
3 0 4
4 1 2
5 0 3
I would like to expand/copy each row n times, where n is that row's value in the "time" column. However, I would also like the variable "time" to be incremented from 1 to n. That is, I would like to produce:
id y time
1 0 1
1 0 2
2 1 1
2 1 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 1 1
4 1 2
5 0 1
5 0 2
5 0 3
As a bonus, I would also like to do a sort of incrementing of the variable "y" where, for those ids with y = 1, y is set to 0 until the largest value of "time". That is, I would like to produce:
id y time
1 0 1
1 0 2
2 0 1
2 0 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 0 1
4 1 2
5 0 1
5 0 2
5 0 3
This seems like something that dplyr might already do, but I just don't know where to look. Regardless, any solution that avoids loops is helpful.
Upvotes: 3
Views: 522
Reputation: 193687
If you're willing to go with "data.table", you can try:
library(data.table)
fooDT <- as.data.table(foo)
fooDT[, list(time = sequence(time)), by = list(id, y)]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 1 1
# 4: 2 1 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 1 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
And, for the bonus question:
fooDT[, list(time = sequence(time)),
by = list(id, y)][, y := {y[1:(.N-1)] <- 0; y},
by = id][]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 0 1
# 4: 2 0 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 0 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
For the bonus question, alternatively:
fooDT[, list(time=seq_len(time)), by=list(id,y)][y == 1,
y := c(rep.int(0, .N-1L), 1), by=id][]
Upvotes: 3
Reputation: 93938
The initial expansion can be achieved with:
newdat <- transform(
foo[rep(rownames(foo),foo$time),],
time = sequence(foo$time)
)
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 1 1
#2.1 2 1 2
#2.2 2 1 3
# etc
To get the complete solution, including the bonus part, then do:
newdat$y[-cumsum(foo$time)] <- 0
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 0 1
#2.1 2 0 2
#2.2 2 1 3
#etc
If you were really excitable, you could do it all in one step using within
:
within(
foo[rep(rownames(foo),foo$time),],
{
time <- sequence(foo$time)
y[-cumsum(foo$time)] <- 0
}
)
Upvotes: 3
Reputation: 660
With dplyr (and magritte for nice legibility):
library(magrittr)
library(dplyr)
foo[rep(1:nrow(foo), foo$time), ] %>%
group_by(id) %>%
mutate(y = !duplicated(y, fromLast = TRUE),
time = 1:n())
Hope it helps
Upvotes: 0
Reputation: 42689
You can create a new data frame with the proper id
and time
columns for the long format, then merge that with the original. This leaves NA
for the unmatched values, which can then be substituted with 0
:
merge(foo,
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
),
all.y=TRUE
)
## id time y
## 1 1 1 NA
## 2 1 2 0
## 3 2 1 NA
## 4 2 2 NA
## 5 2 3 1
## 6 3 1 NA
## 7 3 2 NA
## 8 3 3 NA
## 9 3 4 0
## 10 4 1 NA
## 11 4 2 1
## 12 5 1 NA
## 13 5 2 NA
## 14 5 3 0
A similar merge works for the first expansion. Merge foo
without the time
column with the same created data frame as above:
merge(foo[c('id','y')],
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
)
)
## id y time
## 1 1 0 1
## 2 1 0 2
## 3 2 1 1
## 4 2 1 2
## 5 2 1 3
## 6 3 0 1
## 7 3 0 2
## 8 3 0 3
## 9 3 0 4
## 10 4 1 1
## 11 4 1 2
## 12 5 0 1
## 13 5 0 2
## 14 5 0 3
It's not necessary to specify all
(or all.y
) in the latter expression because there are multiple time
values for each matching id
value, and these are expanded. In the prior case, the time
values were matched from both data frames, and without specifying all
(or all.y
) you would get your original data back.
Upvotes: 3