Reputation: 89
The ultimate objective is to compare the variance and standard deviation of a simple statistic (numerator / denominator / true_count) from the avg_score
for 10 trials of incrementally sized random samples per word from a dataset similar to:
library (data.table)
set.seed(1)
df <- data.frame(
word_ID = c(rep(1,4),rep(2,3),rep(3,2),rep(4,5),rep(5,5),rep(6,3),rep(7,4),rep(8,4),rep(9,6),rep(10,4)),
word = c(rep("cat",4), rep("house", 3), rep("sung",2), rep("door",5), rep("pretty", 5), rep("towel",3), rep("car",4), rep("island",4), rep("ran",6), rep("pizza", 4)),
true_count = c(rep(234,4),rep(39,3),rep(876,2),rep(4,5),rep(67,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(53,4)),
occurrences = c(rep(234,4),rep(34,3),rep(876,2),rep(4,5),rep(65,5),rep(81,3),rep(90,4),rep(43,4),rep(54,6),rep(51,4)),
item_score = runif(40),
avg_score = rnorm(40),
line = c(71,234,71,34,25,32,573,3,673,899,904,2,4,55,55,1003,100,432,100,29,87,326,413,32,54,523,87,988,988,12,24,754,987,12,4276,987,93,65,45,49),
validity = sample(c("T", "F"), 40, replace = T)
)
dt <- data.table(df)
dt[ , denominator := 1:.N, by=word_ID]
dt[ , numerator := 1:.N, by=c("word_ID", "validity")]
dt$numerator[df$validity=="F"] <- 0
df <- dt
<df
word_ID word true_count occurrences item_score avg_score line validity denominator numerator
1: 1 cat 234 234 0.25497614 0.15268651 71 F 1 0
2: 1 cat 234 234 0.18662407 1.77376261 234 F 2 0
3: 1 cat 234 234 0.74554352 -0.64807093 71 T 3 1
4: 1 cat 234 234 0.93296878 -0.19981748 34 T 4 2
5: 2 house 39 34 0.49471189 0.68924373 25 F 1 0
6: 2 house 39 34 0.64499368 0.03614551 32 T 2 1
7: 2 house 39 34 0.17580259 1.94353631 573 F 3 0
8: 3 sung 876 876 0.60299465 0.73721373 3 T 1 1
9: 3 sung 876 876 0.88775767 2.32133393 673 F 2 0
10: 4 door 4 4 0.49020940 0.34890935 899 T 1 1
11: 4 door 4 4 0.01838357 -1.13391666 904 T 2 2
The data represents each detection of a word in a document, so it's possible for a word to appear on the same line more than once. The task is for the sample size to represent unique column values (line
), but to return all instances where the line number is the same- meaning the actual number of rows returned could be more than the specified sample size. So, for one two-word sample size trial for "cat", the form of the desired result would be:
word_ID word true_count occurrences item_score avg_score line validity denominator numerator
1: 1 cat 234 234 0.25497614 0.15268651 71 F 1 0
2: 1 cat 234 234 0.18662407 1.77376261 234 F 2 0
3: 1 cat 234 234 0.74554352 -0.64807093 71 T 3 1
My basic iteration (found on this site) currently looks like:
for (i in 1:10) {
a2[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 2, replace = T), ])
b3[[i]] <- lapply(split(df, df$word_ID), function(x) x[sample(nrow(x), 3, replace = T), ])}
}
So, I can do the standard random sample sizes, but am unsure (and couldn't find something similar or wasn't looking the right way) how to approach the goal stated above. Is there a straight-forward way to approach this?
Thanks,
Upvotes: 4
Views: 1459
Reputation: 38500
Here is a data.table solution that uses a join on a sampled data.table.
set.seed(1234)
df[df[, .(line=sample(unique(line), 2)), by=word], on=.(word, line)]
The inner data.table consists of two columns, word and line, and has two rows per word, each with a unique value for line. The values for line are returned by sample
which is fed the unique values of line and is performed separately for each word (using by=word
). You can vary the number of unique line values by changing 2 to your desired value. This data.table is joined onto the main data.table in order to select the desired rows.
In this instance, you get
word_ID word true_count occurrences item_score avg_score line validity
1: 1 cat 234 234 0.26550866 0.91897737 71 F
2: 1 cat 234 234 0.57285336 0.07456498 71 T
3: 1 cat 234 234 0.37212390 0.78213630 234 T
4: 2 house 39 34 0.89838968 -0.05612874 32 T
5: 2 house 39 34 0.94467527 -0.15579551 573 F
6: 3 sung 876 876 0.62911404 -0.47815006 673 T
7: 3 sung 876 876 0.66079779 -1.47075238 3 T
8: 4 door 4 4 0.06178627 0.41794156 899 F
9: 4 door 4 4 0.38410372 -0.05380504 55 F
10: 5 pretty 67 65 0.71761851 -0.39428995 100 F
11: 5 pretty 67 65 0.38003518 1.10002537 100 F
12: 5 pretty 67 65 0.49769924 -0.41499456 1003 F
13: 6 towel 81 81 0.21214252 -0.25336168 326 F
14: 6 towel 81 81 0.93470523 -0.16452360 87 F
15: 7 car 90 90 0.12555510 0.55666320 32 T
16: 7 car 90 90 0.26722067 -0.68875569 54 F
17: 8 island 43 43 0.01339033 0.36458196 87 T
18: 8 island 43 43 0.38238796 0.76853292 988 F
19: 8 island 43 43 0.86969085 -0.11234621 988 T
20: 9 ran 54 54 0.59956583 -0.61202639 754 F
21: 9 ran 54 54 0.82737332 1.43302370 4276 F
22: 10 pizza 53 51 0.79423986 -0.36722148 93 F
23: 10 pizza 53 51 0.41127443 -0.13505460 49 T
word_ID word true_count occurrences item_score avg_score line validity
Upvotes: 3
Reputation: 160407
If you sample from a de-duplicated data.frame and do a subsequent left-join with the original data, you can ensure what you need.
I'm not proficient with data.table
, so I'll use base functions. (dplyr
would work well here, too, but since you're using data.table
, I'll avoid it for now.) (As I'm about to hit submit, @lmo provided a dt
-specific answer ...)
By "de-duplicate", I mean:
subdf <- df[,c("word_ID", "line")]
subdf <- subdf[!duplicated(subdf),]
dim(subdf)
# [1] 36 2
head(subdf)
# word_ID line
# 1 1 71
# 2 1 234
# 4 1 34
# 5 2 25
# 6 2 32
# 7 2 573
Note that the subdf
only has three rows for 1, whereas the original data has 4:
df[1:4,]
# word_ID word true_count occurrences item_score avg_score line validity
# 1 1 cat 234 234 0.2655087 0.91897737 71 F
# 2 1 cat 234 234 0.3721239 0.78213630 234 T
# 3 1 cat 234 234 0.5728534 0.07456498 71 T
# 4 1 cat 234 234 0.9082078 -1.98935170 34 T
I'm using by
here instead of lapply/split
, but the results should be the same:
out <- by(subdf, subdf$word_ID, function(x) merge(x[sample(nrow(x), 2, replace=TRUE),], df, by=c("word_ID", "line")))
out[1]
# $`1`
# word_ID line word true_count occurrences item_score avg_score validity
# 1 1 34 cat 234 234 0.9082078 -1.98935170 T
# 2 1 71 cat 234 234 0.5728534 0.07456498 T
# 3 1 71 cat 234 234 0.2655087 0.91897737 F
Upvotes: 2