Reputation: 69151
I need to take a data.frame in the format of:
id1 id2 mean start end
1 A D 4 12 15
2 B E 5 14 15
3 C F 6 8 10
and generate duplicate rows based on the difference in start - end
. For example, I need 3 rows for the first row, 1 for the second, and 2 for the third. The start and end fields should be in sequential order in the final data.frame. The end result for this data.frame should be:
id1 id2 mean start end
1 A D 4 12 13
2 A D 4 13 14
3 A D 4 14 15
21 B E 5 14 15
31 C F 6 8 9
32 C F 6 9 10
I have written this function which works, but isn't written in very R'esque code:
dupData <- function(df){
diff <- abs(df$start - df$end)
ret <- {}
#Expand our dataframe into the appropriate number of rows.
for (i in 1:nrow(df)){
for (j in 1:diff[i]){
ret <- rbind(ret, df[i,])
}
}
#If matching ID1 and ID2, generate a sequential ordering of start & end dates
for (k in 2:nrow(ret) - 1) {
if ( ret[k,1] == ret[k + 1, 1] & ret[k, 2] == ret[k, 2] ){
ret[k, 5] <- ret[k, 4] + 1
ret[k + 1, 4] <- ret[k, 5]
}
}
return(ret)
}
Does anyone have suggestions on how to optimize this code? Is there a function in plyr
which may be applicable?
#sample daters
df <- data.frame(id1 = c("A", "B", "C")
, id2 = c("D", "E", "F")
, mean = c(4,5,6)
, start = c(12,14,8)
, end = c(15, 15, 10)
)
Upvotes: 2
Views: 497
Reputation: 193497
Two alternatives, many years later, offering alternatives using today's popular data.table
and tidyverse
packages:
Option 1:
library(data.table)
setDT(mydf)[, list(mean, start = start:(end-1)), .(id1, id2)][, end := start + 1][]
id1 id2 mean start end
1: A D 4 12 13
2: A D 4 13 14
3: A D 4 14 15
4: B E 5 14 15
5: C F 6 8 9
6: C F 6 9 10
Option 2:
library(tidyverse)
mydf %>%
group_by(id1, id2, mean) %>%
summarise(start = list(start:(end-1))) %>%
unnest(start) %>%
mutate(end = start+1)
Upvotes: 0
Reputation: 11
No doubt this isn't one of those times where late is better than never, but i had a similar issue and came up with this...
library(plyr)
ddply(df, c("id1", "id2", "mean", "start", "end"), summarise,
sq=seq(1:(end-start)))
Upvotes: 1
Reputation: 1099
There's probably a more general way to do this, but below uses rbind.fill
.
cbind(df[rep(1:nrow(df), times = apply(df[,4:5], 1, diff)), 1:3],
rbind.fill(apply(df[,4:5], 1, function(x)
data.frame(start = x[1]:(x[2]-1), end = (x[1]+1):x[2]))))
## id1 id2 mean start end
## 1 A D 4 12 13
## 1.1 A D 4 13 14
## 1.2 A D 4 14 15
## 2 B E 5 14 15
## 3 C F 6 8 9
## 3.1 C F 6 9 10
Upvotes: 2
Reputation: 18864
The survSplit
function of the survival
package does something along these lines, though it has a bit more options (eg specifying the cut times). You might be able to use it, or look at its code to see if you can implement your simplified version better.
Upvotes: 1