TheChainsOfMarkov
TheChainsOfMarkov

Reputation: 323

Repeating blocks of rows in a data frame based on another value in the data frame

There are a number of questions here about repeating rows a prespecified number of times in R, but I can't find one to address the specific question I'm asking.

I have a dataframe of responses from a survey in which each respondent answers somewhere between 5 and 10 questions. As a toy example:

df <- data.frame(ID = rep(1:2, each = 5),
             Response = sample(LETTERS[1:4], 10, replace = TRUE),
             Weight = rep(c(2,3), each = 5))

> df
   ID Response Weight
1   1        D      2
2   1        C      2
3   1        D      2
4   1        D      2
5   1        B      2
6   2        D      3
7   2        C      3
8   2        B      3
9   2        D      3
10  2        B      3

I would like to repeat respondent 1's answers twice, as a block, and then respondent 2's answers 3 times, as a block, and I want each block of responses to have a unique ID. In other words, I want the end result to look like this:

     ID Response Weight
1    11        D      2
2    11        C      2
3    11        D      2
4    11        D      2
5    11        B      2
6    12        D      2
7    12        C      2
8    12        D      2
9    12        D      2
10   12        B      2
11   21        D      3
12   21        C      3
13   21        B      3
14   21        D      3
15   21        B      3
16   22        D      3
17   22        C      3
18   22        B      3
19   22        D      3
20   22        B      3
21   23        D      3 
22   23        C      3
23   23        B      3
24   23        D      3
25   23        B      3

The way I'm doing this is currently really clunky, and, given that I have >3000 respondents in my dataset, is unbearably slow.

Here's my code:

df.expanded <- NULL
for(i in unique(df$ID)) {
  x <- df[df$ID == i,]
  y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
  y$order <- rep(1:max(x$Weight), nrow(x))
  y <- y[with(y, order(order)),]
  y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
  df.expanded <- rbind(df.expanded, y)
}

Is there a faster way to do this?

Upvotes: 1

Views: 1366

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193687

Another approach would be to use data.table.

Assuming you're starting with "DT" as your data.table, try:

library(data.table)
DT[, list(.id = rep(seq(Weight[1]), each = .N), Weight, Response), .(ID)]

I haven't pasted the ID columns together, but instead, created a secondary column. That seems a little bit more flexible to me.


Data for testing. Change n to create a larger dataset to play with.

set.seed(1)
n <- 5
weights <- sample(3:15, n, TRUE)
df <- data.frame(ID = rep(seq_along(weights), weights),
                 Response = sample(LETTERS[1:5], sum(weights), TRUE),
                 Weight = rep(weights, weights))
DT <- as.data.table(df)

Upvotes: 1

mt1022
mt1022

Reputation: 17299

There is an easier solution. I suppose you want to duplicate rows based on Weight as shown in your code.

df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')

# sort the rows
df2 <- df2[order(df2$ID), ]

Is this method faster? Let's see:

library(microbenchmark)

microbenchmark(
    m1 = {
        df.expanded <- NULL
        for(i in unique(df$ID)) {
            x <- df[df$ID == i,]
            y <- x[rep(seq_len(nrow(x)), x$Weight),1:3]
            y$order <- rep(1:max(x$Weight), nrow(x))
            y <- y[with(y, order(order)),]
            y$IDNew <- rep(max(y$ID)*100 + 1:max(x$Weight), each = nrow(x))
            df.expanded <- rbind(df.expanded, y)
        }
    },
    m2 = {
        df2 <- df[rep(seq_along(df$Weight), df$Weight), ]
        df2$ID <- paste(df2$ID, unlist(lapply(df$Weight, seq_len)), sep = '')

        # sort the rows
        df2 <- df2[order(df2$ID), ]
    }
)

# Unit: microseconds
# expr     min      lq      mean   median       uq      max neval
# m1 806.295 862.460 1101.6672 921.0690 1283.387 2588.730   100
# m2 171.731 194.199  245.7246 214.3725  283.145  506.184   100

There might be other more efficient ways.

Upvotes: 1

Related Questions