Alastair
Alastair

Reputation: 1773

Using apply and rbind to build an R data.frame

I've got an existing data.frame that contains some initial values. What I want to do is create another data.frame that has 10 randomly sampled rows for every row in the first data.frame. Also I'm trying to do this in an R fashion so I'd like to avoid iteration.

So far I've managed to apply a function to every row in the table that generates one value, however I'm not sure how to extend this to generating 10 rows per application and then rbind-ing the results back together.

Here's my progress so far:

Sample data:

   starts <- structure(list(instance = structure(21:26, .Label = c("big_1", 
   "big_10", "big_11", "big_12", "big_13", "big_14", "big_15", "big_16", 
   "big_17", "big_18", "big_19", "big_2", "big_20", "big_3", "big_4", 
   "big_5", "big_6", "big_7", "big_8", "big_9", "competition01", 
   "competition02", "competition03", "competition04", "competition05", 
   "competition06", "competition07", "competition08", "competition09", 
   "competition10", "competition11", "competition12", "competition13", 
   "competition14", "competition15", "competition16", "competition17", 
   "competition18", "competition19", "competition20", "med_1", "med_10", 
   "med_11", "med_12", "med_13", "med_14", "med_15", "med_16", "med_17", 
   "med_18", "med_19", "med_2", "med_20", "med_3", "med_4", "med_5", 
   "med_6", "med_7", "med_8", "med_9", "small_1", "small_10", "small_11", 
   "small_12", "small_13", "small_14", "small_15", "small_16", "small_17", 
   "small_18", "small_19", "small_2", "small_20", "small_3", "small_4", 
   "small_5", "small_6", "small_7", "small_8", "small_9"), class = "factor"), 
   event.clashes = c(674L, 626L, 604L, 1036L, 991L, 929L), overlaps = c(0L, 
   0L, 0L, 0L, 0L, 0L), room.valid = c(324L, 320L, 268L, 299L, 
   294L, 220L), final.timeslot = c(0L, 0L, 0L, 0L, 0L, 0L), 
   three.in.a.row = c(246L, 253L, 259L, 389L, 365L, 430L), single.event = c(97L, 
   120L, 97L, 191L, 150L, 138L)), .Names = c("instance", "event.clashes", 
   "overlaps", "room.valid", "final.timeslot", "three.in.a.row", 
   "single.event"), row.names = c(NA, 6L), class = "data.frame")

Code:

   library(reshape)
   m.starts <- melt(starts)

   df <- data.frame()

   gen.data <- function(x){
       inst <- x[1]
       constr <- x[2]
       v <- as.integer(x[3])
       val <- as.integer(rnorm(1, max(0, v), v / 2))
       # Should probably return a data.frame here
       print(paste(inst, constr, val))
   }

   apply(m.starts, 1, gen.data)

Upvotes: 2

Views: 14741

Answers (3)

joran
joran

Reputation: 173547

You can combine the ideas of Andrie and Chase's solutions as follows:

#Repeat each row ten times
start.m1 <- start.m[rep(1:nrow(start.m),each = 10),]

#Create extended vector to use to define 
# means/sd
m <- rep(start.m$value,each = 10)

#Remove negative values; 
# although none were in your data
m[m <= 0] <- 0

#Replace value with rnorm values
start.m1$value <- rnorm(nrow(start.m1), mean = m, sd = m / 2)

which yields something that looks like this:

> head(start.m1)
         instance      variable     value
1   competition01 event.clashes 1098.0220
1.1 competition01 event.clashes 1208.4304
1.2 competition01 event.clashes  883.7976
1.3 competition01 event.clashes  365.1396
1.4 competition01 event.clashes  862.3113
1.5 competition01 event.clashes 1352.7085

I'm using Andrie's suggestion to use subset indexing to extend the data frame, and then Chase's interpretation of your question, wherein you seem to want the values to actually be generated via rnorm, rather than resampling the original rows themselves. The key here is that rnorm is vectorized.

Upvotes: 0

Andrie
Andrie

Reputation: 179418

There is no need for apply or rbind. A simple vector subsetting is all that is required:

samples <- sample(1:nrow(starts), nrow(starts)*10, replace=TRUE)
starts[samples, 1:3]

The first 5 rows of results:

> head(starts[samples, 1:3], 5)

         instance event.clashes overlaps
2   competition02           626        0
5   competition05           991        0
6   competition06           929        0
4   competition04          1036        0
2.1 competition02           626        0

Upvotes: 1

Chase
Chase

Reputation: 69171

It's unclear to me what you're really doing, but the following changes to your gen_data function seem to do what you want. Specifically, it's not clear to me what you are doing with val as this seemingly just generates a random number with a mean of the value column for that row and a standard deviation of that value divided by two. Is that what you want? I added a new parameter to your function to account for the number of rows you want to generate as well:

gen.data <- function(x, nreps = 10){
    inst <- x[1]
        constr <- x[2]
        v <- as.integer(x[3])
        val <- as.integer(rnorm(nreps, max(0, v), v / 2))

        out <- data.frame(inst = rep(inst, nreps)
            , constr = rep(constr, nreps)
         , val = val)

    return(out)
       }

And then in use:

do.call("rbind", apply(m.starts, 1, gen.data))

Results in:

             inst         constr  val
1   competition01  event.clashes  876
2   competition01  event.clashes  714
3   competition01  event.clashes  912
4   competition01  event.clashes  -46
5   competition01  event.clashes  369
....
....
357 competition06   single.event  149
358 competition06   single.event  248
359 competition06   single.event  128
360 competition06   single.event  168

Upvotes: 9

Related Questions