Stefano Lombardi
Stefano Lombardi

Reputation: 1591

Reshaping from wide to long and vice versa (multistate/survival analysis dataset)

I am trying to reshape the following dataset with reshape(), without much results.

The starting dataset is in "wide" form, with each id described through one row. The dataset is intended to be adopted for carry out Multistate analyses (a generalization of Survival Analysis).

Each person is recorded for a given overall time span. During this period the subject can experience a number of transitions among states (for simplicity let us fix to two the maximum number of distinct states that can be visited). The first visited state is s1 = 1, 2, 3, 4. The person stays within the state for dur1 time periods, and the same applies for the second visited state s2:

   id    cohort    s1     dur1     s2     dur2     
     1      1        3      4       2      5       
     2      0        1      4       4      3    

The dataset in long format which I woud like to obtain is:

id    cohort    s    
1       1       3
1       1       3
1       1       3
1       1       3
1       1       2
1       1       2
1       1       2
1       1       2
1       1       2
2       0       1
2       0       1
2       0       1
2       0       1
2       0       4
2       0       4
2       0       4

In practice, each id has dur1 + dur2 rows, and s1 and s2 are melted in a single variable s.

How would you do this transformation? Also, how would you cmoe back to the original dataset "wide" form?

Many thanks!

dat <- cbind(id=c(1,2), cohort=c(1, 0), s1=c(3, 1), dur1=c(4, 4), s2=c(2, 4), dur2=c(5, 3))

Upvotes: 0

Views: 650

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

You can use reshape() for the first step, but then you need to do some more work. Also, reshape() needs a data.frame() as its input, but your sample data is a matrix.

Here's how to proceed:

  1. reshape() your data from wide to long:

    dat2 <- reshape(data.frame(dat), direction = "long", 
                    idvar = c("id", "cohort"),
                    varying = 3:ncol(dat), sep = "")
    dat2
    #       id cohort time s dur
    # 1.1.1  1      1    1 3   4
    # 2.0.1  2      0    1 1   4
    # 1.1.2  1      1    2 2   5
    # 2.0.2  2      0    2 4   3
    
  2. "Expand" the resulting data.frame using rep()

    dat3 <- dat2[rep(seq_len(nrow(dat2)), dat2$dur), c("id", "cohort", "s")]
    dat3[order(dat3$id), ]
    #         id cohort s
    # 1.1.1    1      1 3
    # 1.1.1.1  1      1 3
    # 1.1.1.2  1      1 3
    # 1.1.1.3  1      1 3
    # 1.1.2    1      1 2
    # 1.1.2.1  1      1 2
    # 1.1.2.2  1      1 2
    # 1.1.2.3  1      1 2
    # 1.1.2.4  1      1 2
    # 2.0.1    2      0 1
    # 2.0.1.1  2      0 1
    # 2.0.1.2  2      0 1
    # 2.0.1.3  2      0 1
    # 2.0.2    2      0 4
    # 2.0.2.1  2      0 4
    # 2.0.2.2  2      0 4 
    

You can get rid of the funky row names too by using rownames(dat3) <- NULL.

Update: Retaining the ability to revert to the original form

In the example above, since we dropped the "time" and "dur" variables, it isn't possible to directly revert to the original dataset. If you feel this is something you would need to do, I suggest keeping those columns in and creating another data.frame with the subset of the columns that you need if required.

Here's how:

Use aggregate() to get back to "dat2":

aggregate(cbind(s, dur) ~ ., dat3, unique)
#   id cohort time s dur
# 1  2      0    1 1   4
# 2  1      1    1 3   4
# 3  2      0    2 4   3
# 4  1      1    2 2   5

Wrap reshape() around that to get back to "dat1". Here, in one step:

reshape(aggregate(cbind(s, dur) ~ ., dat3, unique), 
        direction = "wide", idvar = c("id", "cohort"))
#   id cohort s.1 dur.1 s.2 dur.2
# 1  2      0   1     4   4     3
# 2  1      1   3     4   2     5

Upvotes: 3

Mark Miller
Mark Miller

Reputation: 13103

There are probably better ways, but this might work.

df <- read.table(text = '
   id    cohort    s1     dur1     s2     dur2     
     1      1        3      4       2      5       
     2      0        1      4       4      3',
header=TRUE)

hist <- matrix(0, nrow=2, ncol=9)
hist

for(i in 1:nrow(df)) {

hist[i,] <- c(rep(df[i,3], df[i,4]), rep(df[i,5], df[i,6]), rep(0, (9 - df[i,4] - df[i,6])))

}

hist

hist2 <- cbind(df[,1:2], hist)
colnames(hist2) <- c('id', 'cohort', paste('x', seq_along(1:9), sep=''))

library(reshape2)

hist3 <- melt(hist2, id.vars=c('id', 'cohort'), variable.name='x', value.name='state')

hist4 <- hist3[order(hist3$id, hist3$cohort),]
hist4

hist4 <- hist4[ , !names(hist4) %in% c("x")]

hist4 <- hist4[!(hist4[,2]==0 & hist4[,3]==0),]

Gives:

   id cohort state
1   1      1     3
3   1      1     3
5   1      1     3
7   1      1     3
9   1      1     2
11  1      1     2
13  1      1     2
15  1      1     2
17  1      1     2
2   2      0     1
4   2      0     1
6   2      0     1
8   2      0     1
10  2      0     4
12  2      0     4
14  2      0     4

Of course, if you have more than two states per id then this would have to be modified (and it might have to be modified if you have more than two cohorts). For example, I suppose with 9 sample periods one person could be in the following sequence of states:

1 3 2 4 3 4 1 1 2

Upvotes: 1

Related Questions