How to randomly sample dataframe rows with unique column values

Question

The ultimate objective is to compare the variance and standard deviation of a simple statistic (numerator / denominator / true_count) from the avg_score for 10 trials of incrementally sized random samples per word from a dataset similar to:

library (data.table)
set.seed(1)
df <- data.frame(
  word_ID = c(rep(1,4),rep(2,3),rep(3,2),rep(4,5),rep(5,5),rep(6,3),rep(7,4),rep(8,4),rep(9,6),rep(10,4)),
  word = c(rep("cat",4), rep("house", 3), rep("sung",2), rep("door",5), rep("pretty", 5), rep("towel",3), rep("car",4), rep("island",4), rep("ran",6), rep("pizza", 4)), 
  true_count = c(rep(234,4),rep(39,3),rep(876,2),rep(4,5),rep(67,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(53,4)),
  occurrences = c(rep(234,4),rep(34,3),rep(876,2),rep(4,5),rep(65,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(51,4)),
  item_score = runif(40),
  avg_score = rnorm(40),
  line = c(71,234,71,34,25,32,573,3,673,899,904,2,4,55,55,1003,100,432,100,29,87,326,413,32,54,523,87,988,988,12,24,754,987,12,4276,987,93,65,45,49),
  validity = sample(c("T", "F"), 40, replace = T)

)
dt <- data.table(df)
dt[ , denominator := 1:.N, by=word_ID]
dt[ , numerator := 1:.N, by=c("word_ID", "validity")]
dt$numerator[df$validity=="F"] <- 0
df <- dt



The data represents each detection of a word in a document, so it's possible for a word to appear on the same line more than once. The task is for the sample size to represent unique column values (line), but to return all instances where the line number is the same- meaning the actual number of rows returned could be more than the specified sample size. So, for one two-word sample size trial for "cat", the form of the desired result would be:

    word_ID  word  true_count occurrences item_score   avg_score line validity denominator numerator
 1:       1    cat        234         234 0.25497614  0.15268651   71        F           1         0
 2:       1    cat        234         234 0.18662407  1.77376261  234        F           2         0
 3:       1    cat        234         234 0.74554352 -0.64807093   71        T           3         1


My basic iteration (found on this site) currently looks like:

for (i in 1:10) {

  a2[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 2, replace = T), ])

  b3[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 3, replace = T), ])}

}


So, I can do the standard random sample sizes, but am unsure (and couldn't find something similar or wasn't looking the right way) how to approach the goal stated above. Is there a straight-forward way to approach this?

Thanks,

lmo · Accepted Answer

Here is a data.table solution that uses a join on a sampled data.table.

set.seed(1234)
df[df[, .(line=sample(unique(line), 2)), by=word], on=.(word, line)]

The inner data.table consists of two columns, word and line, and has two rows per word, each with a unique value for line. The values for line are returned by sample which is fed the unique values of line and is performed separately for each word (using by=word). You can vary the number of unique line values by changing 2 to your desired value. This data.table is joined onto the main data.table in order to select the desired rows.

In this instance, you get

    word_ID   word true_count occurrences item_score   avg_score line validity
 1:       1    cat        234         234 0.26550866  0.91897737   71        F
 2:       1    cat        234         234 0.57285336  0.07456498   71        T
 3:       1    cat        234         234 0.37212390  0.78213630  234        T
 4:       2  house         39          34 0.89838968 -0.05612874   32        T
 5:       2  house         39          34 0.94467527 -0.15579551  573        F
 6:       3   sung        876         876 0.62911404 -0.47815006  673        T
 7:       3   sung        876         876 0.66079779 -1.47075238    3        T
 8:       4   door          4           4 0.06178627  0.41794156  899        F
 9:       4   door          4           4 0.38410372 -0.05380504   55        F
10:       5 pretty         67          65 0.71761851 -0.39428995  100        F
11:       5 pretty         67          65 0.38003518  1.10002537  100        F
12:       5 pretty         67          65 0.49769924 -0.41499456 1003        F
13:       6  towel         81          81 0.21214252 -0.25336168  326        F
14:       6  towel         81          81 0.93470523 -0.16452360   87        F
15:       7    car         90          90 0.12555510  0.55666320   32        T
16:       7    car         90          90 0.26722067 -0.68875569   54        F
17:       8 island         43          43 0.01339033  0.36458196   87        T
18:       8 island         43          43 0.38238796  0.76853292  988        F
19:       8 island         43          43 0.86969085 -0.11234621  988        T
20:       9    ran         54          54 0.59956583 -0.61202639  754        F
21:       9    ran         54          54 0.82737332  1.43302370 4276        F
22:      10  pizza         53          51 0.79423986 -0.36722148   93        F
23:      10  pizza         53          51 0.41127443 -0.13505460   49        T
    word_ID   word true_count occurrences item_score   avg_score line validity

How to randomly sample dataframe rows with unique column values

Answers (2)

Related Questions