ECII
ECII

Reputation: 10629

Transform a data.frame, while filling missing values

I have the data frame

data<-data.frame(id=c("A","A","B","B"), day=c(5,6,1,2), duration=c(12,1440,5,6), obs.period=c(60, 60,100,100))

showing Subject ID, day of event, duration of event, and observation period of Subject

I want to transform the data set to that it will show the whole observation period for each subject (all days of observation), while adding zero as duration values for the days where no event was observed

For the above dataset this would be something like this:

id  day duration    obs.period
A   1   0   60
A   2   0   60
A   3   0   60
A   4   0   60
A   5   12  60
A   6   1440    60
A   7   0   60
A   8   0   60
    .       
    .       
    .       
A   60  0   60
B   1   5   100
B   2   6   100
B   3   0   100
B   4   0   100
    .       
    .       
    .       
    .       
B   100 0   100

Any ideas?

Upvotes: 2

Views: 618

Answers (4)

edgester
edgester

Reputation: 503

Create an empty data frame with the proper index columns, but no value columns, then merge it with your data and replace the NA's in the value columns with zeros.

data<-data.frame(id=c("A","A","B","B"), day=c(5,6,1,2), duration=c(12,1440,5,6), obs.period=c(60, 60,100,100))
zilch=data.frame(id=rep(c("A","B"),each=60),day=1:60)
all=merge(zilch,data, all=T)
all[is.na(all$duration),"duration"]<-0
all[is.na(all$obs.period),"obs.period"]<-0

Upvotes: 2

Chase
Chase

Reputation: 69221

Here's one approach using the plyr package. First, create a function to expand the data into the appropriate number of rows. Then, index into that new data.frame with the duration info from the original data. Finally, call this function with ddply() and group on the id variable.

require(plyr)
FUN <- function(x){
  dat <- data.frame(
    id = x[1,1]
    , day = seq_len(x[1,4])
    , duration = 0
    , obs.period = x[1,4]
    )

  dat[dat$id == x$id & dat$day == x$day, "duration"] <- x$duration
  return(dat)
}


ddply(data, "id", FUN)

    id day duration obs.period
1    A   1        0         60
2    A   2        0         60
3    A   3        0         60
4    A   4        0         60
5    A   5       12         60
6    A   6     1440         60
...
61   B   1        5        100
62   B   2        6        100
63   B   3        0        100
...
160  B 100        0        100

Upvotes: 3

Karsten W.
Karsten W.

Reputation: 18500

Here is an approach with plyr

fill1 <- function(df) {
  full_period <- 1:100
  to_fill <- setdiff(full_period, df$day)
  fill_id <- df[1,"id"]
  fill_dur <- 0
  fill_obs.p <- df[1,"obs.period"]
  rows_to_add <- data.frame(id=fill_id, day=to_fill, duration=fill_dur, obs.period=fill_obs.p)
  rbind(df,rows_to_add)
}
ddply(data, "id", fill1)

The result is not sorted by id, duration, however.

Upvotes: 1

Karl
Karl

Reputation: 2019

I would first create a data frame to contain the results.

ob.period <- with(data, tapply(obs.period, id, max))

n <- sum(ob.period)
result <- data.frame(id=rep(names(ob.period), ob.period),
                     day=unlist(lapply(ob.period, function(a) 1:a)),
                     duration=rep(0, n),
                     obs.period=rep(ob.period,ob.period))

Then I would paste id and day together, use match to find the relevant rows in the larger data frame, and plug in the duration values.

idday.sm <- paste(data$id, data$day, sep=":")
idday.lg <- paste(result$id, result$day, sep=":")

result$duration[match(idday.sm, idday.lg)] <- data$duration

Upvotes: 1

Related Questions