xq1515426
xq1515426

Reputation: 89

How to randomly sample dataframe rows with unique column values

The ultimate objective is to compare the variance and standard deviation of a simple statistic (numerator / denominator / true_count) from the avg_score for 10 trials of incrementally sized random samples per word from a dataset similar to:

library (data.table)
set.seed(1)
df <- data.frame(
  word_ID = c(rep(1,4),rep(2,3),rep(3,2),rep(4,5),rep(5,5),rep(6,3),rep(7,4),rep(8,4),rep(9,6),rep(10,4)),
  word = c(rep("cat",4), rep("house", 3), rep("sung",2), rep("door",5), rep("pretty", 5), rep("towel",3), rep("car",4), rep("island",4), rep("ran",6), rep("pizza", 4)), 
  true_count = c(rep(234,4),rep(39,3),rep(876,2),rep(4,5),rep(67,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(53,4)),
  occurrences = c(rep(234,4),rep(34,3),rep(876,2),rep(4,5),rep(65,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(51,4)),
  item_score = runif(40),
  avg_score = rnorm(40),
  line = c(71,234,71,34,25,32,573,3,673,899,904,2,4,55,55,1003,100,432,100,29,87,326,413,32,54,523,87,988,988,12,24,754,987,12,4276,987,93,65,45,49),
  validity = sample(c("T", "F"), 40, replace = T)

)
dt <- data.table(df)
dt[ , denominator := 1:.N, by=word_ID]
dt[ , numerator := 1:.N, by=c("word_ID", "validity")]
dt$numerator[df$validity=="F"] <- 0
df <- dt

<df
    word_ID  word  true_count occurrences item_score   avg_score line validity denominator numerator
 1:       1    cat        234         234 0.25497614  0.15268651   71        F           1         0
 2:       1    cat        234         234 0.18662407  1.77376261  234        F           2         0
 3:       1    cat        234         234 0.74554352 -0.64807093   71        T           3         1
 4:       1    cat        234         234 0.93296878 -0.19981748   34        T           4         2
 5:       2  house         39          34 0.49471189  0.68924373   25        F           1         0
 6:       2  house         39          34 0.64499368  0.03614551   32        T           2         1
 7:       2  house         39          34 0.17580259  1.94353631  573        F           3         0
 8:       3   sung        876         876 0.60299465  0.73721373    3        T           1         1
 9:       3   sung        876         876 0.88775767  2.32133393  673        F           2         0
10:       4   door          4           4 0.49020940  0.34890935  899        T           1         1
11:       4   door          4           4 0.01838357 -1.13391666  904        T           2         2

The data represents each detection of a word in a document, so it's possible for a word to appear on the same line more than once. The task is for the sample size to represent unique column values (line), but to return all instances where the line number is the same- meaning the actual number of rows returned could be more than the specified sample size. So, for one two-word sample size trial for "cat", the form of the desired result would be:

    word_ID  word  true_count occurrences item_score   avg_score line validity denominator numerator
 1:       1    cat        234         234 0.25497614  0.15268651   71        F           1         0
 2:       1    cat        234         234 0.18662407  1.77376261  234        F           2         0
 3:       1    cat        234         234 0.74554352 -0.64807093   71        T           3         1

My basic iteration (found on this site) currently looks like:

for (i in 1:10) {

  a2[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 2, replace = T), ])

  b3[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 3, replace = T), ])}

}

So, I can do the standard random sample sizes, but am unsure (and couldn't find something similar or wasn't looking the right way) how to approach the goal stated above. Is there a straight-forward way to approach this?

Thanks,

Upvotes: 4

Views: 1459

Answers (2)

lmo
lmo

Reputation: 38500

Here is a data.table solution that uses a join on a sampled data.table.

set.seed(1234)
df[df[, .(line=sample(unique(line), 2)), by=word], on=.(word, line)]

The inner data.table consists of two columns, word and line, and has two rows per word, each with a unique value for line. The values for line are returned by sample which is fed the unique values of line and is performed separately for each word (using by=word). You can vary the number of unique line values by changing 2 to your desired value. This data.table is joined onto the main data.table in order to select the desired rows.

In this instance, you get

    word_ID   word true_count occurrences item_score   avg_score line validity
 1:       1    cat        234         234 0.26550866  0.91897737   71        F
 2:       1    cat        234         234 0.57285336  0.07456498   71        T
 3:       1    cat        234         234 0.37212390  0.78213630  234        T
 4:       2  house         39          34 0.89838968 -0.05612874   32        T
 5:       2  house         39          34 0.94467527 -0.15579551  573        F
 6:       3   sung        876         876 0.62911404 -0.47815006  673        T
 7:       3   sung        876         876 0.66079779 -1.47075238    3        T
 8:       4   door          4           4 0.06178627  0.41794156  899        F
 9:       4   door          4           4 0.38410372 -0.05380504   55        F
10:       5 pretty         67          65 0.71761851 -0.39428995  100        F
11:       5 pretty         67          65 0.38003518  1.10002537  100        F
12:       5 pretty         67          65 0.49769924 -0.41499456 1003        F
13:       6  towel         81          81 0.21214252 -0.25336168  326        F
14:       6  towel         81          81 0.93470523 -0.16452360   87        F
15:       7    car         90          90 0.12555510  0.55666320   32        T
16:       7    car         90          90 0.26722067 -0.68875569   54        F
17:       8 island         43          43 0.01339033  0.36458196   87        T
18:       8 island         43          43 0.38238796  0.76853292  988        F
19:       8 island         43          43 0.86969085 -0.11234621  988        T
20:       9    ran         54          54 0.59956583 -0.61202639  754        F
21:       9    ran         54          54 0.82737332  1.43302370 4276        F
22:      10  pizza         53          51 0.79423986 -0.36722148   93        F
23:      10  pizza         53          51 0.41127443 -0.13505460   49        T
    word_ID   word true_count occurrences item_score   avg_score line validity

Upvotes: 3

r2evans
r2evans

Reputation: 160407

If you sample from a de-duplicated data.frame and do a subsequent left-join with the original data, you can ensure what you need.

I'm not proficient with data.table, so I'll use base functions. (dplyr would work well here, too, but since you're using data.table, I'll avoid it for now.) (As I'm about to hit submit, @lmo provided a dt-specific answer ...)

By "de-duplicate", I mean:

subdf <- df[,c("word_ID", "line")]
subdf <- subdf[!duplicated(subdf),]
dim(subdf)
# [1] 36  2
head(subdf)
#   word_ID line
# 1       1   71
# 2       1  234
# 4       1   34
# 5       2   25
# 6       2   32
# 7       2  573

Note that the subdf only has three rows for 1, whereas the original data has 4:

df[1:4,]
#   word_ID word true_count occurrences item_score   avg_score line validity
# 1       1  cat        234         234  0.2655087  0.91897737   71        F
# 2       1  cat        234         234  0.3721239  0.78213630  234        T
# 3       1  cat        234         234  0.5728534  0.07456498   71        T
# 4       1  cat        234         234  0.9082078 -1.98935170   34        T

I'm using by here instead of lapply/split, but the results should be the same:

out <- by(subdf, subdf$word_ID, function(x) merge(x[sample(nrow(x), 2, replace=TRUE),], df, by=c("word_ID", "line")))
out[1]
# $`1`
#   word_ID line word true_count occurrences item_score   avg_score validity
# 1       1   34  cat        234         234  0.9082078 -1.98935170        T
# 2       1   71  cat        234         234  0.5728534  0.07456498        T
# 3       1   71  cat        234         234  0.2655087  0.91897737        F

Upvotes: 2

Related Questions