Reputation: 69241

Reshape data based on column in dataframe

I need to take a data.frame in the format of:

  id1 id2 mean start end
1   A   D    4    12  15
2   B   E    5    14  15
3   C   F    6     8  10

and generate duplicate rows based on the difference in start - end. For example, I need 3 rows for the first row, 1 for the second, and 2 for the third. The start and end fields should be in sequential order in the final data.frame. The end result for this data.frame should be:

   id1 id2 mean start end
1    A   D    4    12  13
2    A   D    4    13  14
3    A   D    4    14  15
21   B   E    5    14  15
31   C   F    6     8   9
32   C   F    6     9  10

I have written this function which works, but isn't written in very R'esque code:

dupData <- function(df){
    diff <- abs(df$start - df$end)
    ret <- {}

    #Expand our dataframe into the appropriate number of rows.
    for (i in 1:nrow(df)){
        for (j in 1:diff[i]){
            ret <- rbind(ret, df[i,])
        } 
    }

    #If matching ID1 and ID2, generate a sequential ordering of start & end dates
    for (k in 2:nrow(ret) - 1) {
        if ( ret[k,1] == ret[k + 1, 1] & ret[k, 2] == ret[k, 2]  ){ 
            ret[k, 5] <- ret[k, 4] + 1
            ret[k + 1, 4] <- ret[k, 5]  
        }
    }
    return(ret)
}

Does anyone have suggestions on how to optimize this code? Is there a function in plyr which may be applicable?

#sample daters
df <- data.frame(id1 = c("A", "B", "C")
        , id2 = c("D", "E", "F")
        , mean = c(4,5,6)  
        , start = c(12,14,8)
        , end = c(15, 15, 10)
)

Upvotes: 2

Answers (4)

A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

Two alternatives, many years later, offering alternatives using today's popular data.table and tidyverse packages:

Option 1:

library(data.table)
setDT(mydf)[, list(mean, start = start:(end-1)), .(id1, id2)][, end := start + 1][]
   id1 id2 mean start end
1:   A   D    4    12  13
2:   A   D    4    13  14
3:   A   D    4    14  15
4:   B   E    5    14  15
5:   C   F    6     8   9
6:   C   F    6     9  10

Option 2:

library(tidyverse)
mydf %>% 
  group_by(id1, id2, mean) %>% 
  summarise(start = list(start:(end-1))) %>% 
  unnest(start) %>% 
  mutate(end = start+1)

Upvotes: 0

rakso-dog

Reputation: 11

No doubt this isn't one of those times where late is better than never, but i had a similar issue and came up with this...

library(plyr)
ddply(df, c("id1", "id2", "mean", "start", "end"), summarise,
                    sq=seq(1:(end-start)))

Upvotes: 1

apeescape

Reputation: 1099

There's probably a more general way to do this, but below uses rbind.fill.

cbind(df[rep(1:nrow(df), times = apply(df[,4:5], 1, diff)), 1:3],
      rbind.fill(apply(df[,4:5], 1, function(x)
                       data.frame(start = x[1]:(x[2]-1), end = (x[1]+1):x[2]))))


##     id1 id2 mean start end
## 1     A   D    4    12  13
## 1.1   A   D    4    13  14
## 1.2   A   D    4    14  15
## 2     B   E    5    14  15
## 3     C   F    6     8   9
## 3.1   C   F    6     9  10

Upvotes: 2

Aniko

Reputation: 18884

The survSplit function of the survival package does something along these lines, though it has a bit more options (eg specifying the cut times). You might be able to use it, or look at its code to see if you can implement your simplified version better.

Upvotes: 1

Reshape data based on column in dataframe

Answers (4)

Related Questions