TheChainsOfMarkov
TheChainsOfMarkov

Reputation: 323

Select rows in a data.frame when some rows repeat

I have the following toy dataset

set.seed(100)
df <- data.frame(ID = rep(1:5, each = 3),
                 value = sample(LETTERS, 15, replace = TRUE),
                 weight = rep(c(0.1, 0.1, 0.5, 0.2, 0.1), each = 3))
df

   ID value weight
1   1     I    0.1
2   1     G    0.1
3   1     O    0.1
4   2     B    0.1
5   2     M    0.1
6   2     M    0.1
7   3     V    0.5
8   3     J    0.5
9   3     O    0.5
10  4     E    0.2
11  4     Q    0.2
12  4     W    0.2
13  5     H    0.1
14  5     K    0.1
15  5     T    0.1

where each ID is an individual respondent, answering 3 questions (in the actual dataset, the number of questions answered is variable, so I can't rely on a certain number of rows per ID).

I want to create a new (larger) dataset which samples from the individual IDs based on the weights in weight.

probs <- data.frame(ID = unique(df$ID))
probs$prob <- NA
for(i in 1:nrow(probs)){
  probs$prob[i] <- df[df$ID %in% probs$ID[i],]$weight[1]
}
probs$prob <- probs$prob / sum(probs$prob)
sampledIDs <- sample(probs$ID, size = 10000, replace = TRUE, prob = probs$prob)
head(sampledIDs,10)

[1] 4 3 3 3 4 4 2 4 2 3

Moving from the probabilistic sampling of IDs to the actual creation of the new data.frame is stumping me. I've tried

dfW <- df[df$ID %in% sampledIDs,]

but that obviously doesn't take into account the fact that IDs repeat. I've also tried a loop:

dfW <- df[df$ID == sampledIDs[1],]
for(i in 2:length(sampledIDs)){
  dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
}

but that's painfully slow with a large dataset.

Any help would be very appreciated.

(Also, if there are simpler ways of doing the probabilistic selection of IDs, that would be great to hear too!)

Upvotes: 0

Views: 250

Answers (2)

Reza Dodge
Reza Dodge

Reputation: 190

If you don't know the final size you can resize it whenever needed, but a new if condition should be added in the for loop. First define the function to resize the dataframe as follow:

double_rowsize <- function(df) {
  mdf <- as.data.frame(matrix(, nrow = nrow(df), ncol = ncol(df)))
  colnames(mdf) <- colnames(df)
  df <- rbind(df, mdf)
  return(df)
}

Then start the dfW with an initial size like 12 (3 times 4):

dfW <- as.data.frame(matrix(nrow = 12, ncol = 3))
colnames(dfW) <- colnames(df)

And finally add an if condition in the for loop to resize the dataframe whenever needed:

for(i in 1:length(sampledIDs)){ 
  if (3*i > nrow(dfW))
    dfW <- double_rowsize(dfW)
  dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}

You can change the details of function double_rowsize to change the dataframe size with a different number rather than 2 if anything else works better. 2 is common because it works best in array resizing.

Good luck!

Upvotes: 0

Reza Dodge
Reza Dodge

Reputation: 190

The code speed is low because you resize the data frame in every cycle of the for loop. Here is my suggestion. Create a dataframe with the final size that the data framedfW will have before the for loop. Then assign the values from data frame df to dfW in the for loop. You may change the last part of your code with this:

dfW <- as.data.frame(matrix(nrow = 3 * length(sampledIDs), ncol = 3))
colnames(dfW) <- colnames(df)  # make the column names the same
for(i in 1:length(sampledIDs)){ # notice the start index is changed from 2 to 1
    #dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
    dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}

Your code should run much faster with this change. Let me know how it goes!

Upvotes: 1

Related Questions