Reputation: 59
I need to to create a sample from my dataframe and to do so I am using the code bellow.
name <- sample(c("Adam","John","Henry","Mike"),100,rep = TRUE)
area <- sample(c("run","develop","test"),100,rep = TRUE)
id <- sample(100:200,100,rep = FALSE)
mydata <- as.data.frame(cbind(id,area,name))
qcsample <- mydata %>%
group_by(area) %>%
nest() %>%
mutate(n = c(20, 15, 15)) %>%
mutate(samp = map2(data, n, sample_n)) %>%
select(area, samp) %>%
unnest()
Now, I am getting these results.
table(qcsample$area)
develop run test
15 15 20
--
table(qcsample$name)
Adam Henry John Mike
9 9 16 16
I would like to create a sample that would have more or less the same number of samples for each name eg. Adam - 12, Henry - 12, John - 13, Mike - 13. How can I achieve that ? can I somehow request that the sample is equally distributed ?
Also, in this example I used function
sample_n
and specified number of samples.
I am anticipating that sometimes there will not be required number from a given group. In my example I am taking 20 samples from area called "test" but sometimes there will be only let's say 10 rows containing "test". The total number is 50 so I need to make sure if there are only 10 "test" the code has to automatically increase the others, so the sample would be "test" - 10, "run" - 20 and "develop" - 20. This can happen to any of the area so I need to test if there is enough rows to create the sample and increase other areas. If there is only 1 it can be added to any of the remaining areas or if the difference is 3 we add 1 to one area and 2 to the another one.
How could I check that taking into account all the possibilities ? I believe there are eight permutations in this case.
Thanks in advance A.
Upvotes: 2
Views: 297
Reputation: 160447
Here's another thought.
Depending on your desired end-size, it might over-create the number of samples so that it can reduce some name/area pairs to bring the total down.
Let's say that you want to end up with a total of 50 rows:
final_size <- 50
For completeness, here are the sets from which we'll choose:
avail_names <- c("Adam", "John", "Henry", "Mike")
avail_areas <- c("run", "develop", "test")
and the minimum we need to create for Adam,run
(etc) in order to certainly end up with no less than final_size
rows:
size_per_namearea <- ceiling(final_size / (length(avail_names) * length(avail_areas)))
Ok, generate at least as many (likely more than) the number of rows we need:
set.seed(20180920)
qcsample <- crossing(data_frame(rownum = seq_len(size_per_namearea)),
data_frame(name = avail_names),
data_frame(area = avail_areas)) %>%
group_by(name, area) %>%
mutate(id = sample(100, size = n(), replace = FALSE))
qcsample
# # A tibble: 60 x 4
# # Groups: name, area [12]
# rownum name area id
# <int> <chr> <chr> <int>
# 1 1 Adam run 59
# 2 1 Adam develop 51
# 3 1 Adam test 23
# 4 1 John run 71
# 5 1 John develop 5
# 6 1 John test 24
# 7 1 Henry run 4
# 8 1 Henry develop 29
# 9 1 Henry test 79
# 10 1 Mike run 77
# # ... with 50 more rows
Verify we have identical sample sizes for each name/area:
xtabs(~ name + area, data = qcsample) %>%
stats::addmargins()
# area
# name develop run test Sum
# Adam 5 5 5 15
# Henry 5 5 5 15
# John 5 5 5 15
# Mike 5 5 5 15
# Sum 20 20 20 60
If we just do head(final_size)
, then we know which names we will be cutting short, which undermines the randomness of your sampling a little. The reason I added rownum
up front was so that I can arrange by it plus a jitter, ensuring I get all of max(rownum)-1
, and then some sampling of max(rownum)
, guaranteeing that each name/area pair have either max(rownum)-1
or max(rownum)
rows; your tallies are never different by more than 1.
reducedsample <- arrange(qcsample, rownum + runif(n())) %>%
head(final_size) %>%
select(-rownum)
reducedsample %>%
xtabs(~ name + area, data = .) %>%
stats::addmargins()
# area
# name develop run test Sum
# Adam 4 4 5 13
# Henry 5 4 4 13
# John 4 4 4 12
# Mike 4 4 4 12
# Sum 17 16 17 50
Upvotes: 1
Reputation: 2263
If you are using made up data then you can create a minimum amount of each row and then create filler to get you up to the total:
set.seed(42)
names <- c("Adam", "John", "Henry", "Mike")
areas <- c("run", "develop", "test")
totalrows <- 100
minname <- 22 # No less than 20 of each name (set to near threshold to test)
minarea <- 30 # No less than 30 of each area (less randomness the higher these are)
qcsample <- data.frame(
name=sample(c(rep(names, minname), sample(names, totalrows-length(names)*minname, replace=T))),
area=sample(c(rep(areas, minarea), sample(areas, totalrows-length(areas)*minarea, replace=T))),
id=sample(99+(1:totalrows))
)
This results in:
R> table(qcsample$name)
Adam Henry John Mike
23 28 24 25
R> table(qcsample$area)
develop run test
37 31 32
Notice that the count of name
to area
isn't constrained:
R> table(qcsample[,-3])
area
name develop run test
Adam 5 11 7
Henry 11 8 9
John 10 7 7
Mike 11 5 9
R>
Using a loop as suggested by @r2evans:
library(dplyr)
set.seed(42)
mydata <- data.frame(
name = sample(c("Adam","John","Henry","Mike"), 100, rep = TRUE),
area = sample(c("run","develop","test"), 100, rep = TRUE),
id = sample(100:200, 100, rep = FALSE)
)
Nsamples <- 50
mysample <- data.frame(sample_n(mydata, Nsamples))
minname <- 11 # max is 50/4 -> 12
minarea <- 15 # max is 50/3 -> 16
# the test you were asking about
while( (min(table(mysample$name)) < minname) || (min(table(mysample$area)) < minarea) ) {
mysample <- data.frame(sample_n(mydata, Nsamples))
}
This results in:
R> table(mysample$name)
Adam Henry John Mike
13 15 11 11
R> table(mysample$area)
develop run test
15 17 18
And, like before, there's no minimum of name to area.
R> table(mysample[-3])
area
name develop run test
Adam 4 3 6
Henry 2 6 7
John 4 4 3
Mike 5 4 2
If you needed to enforce an minimum number for each permutation add this to the test:
while(... || (min(table(mysample[-3])) < some_min)) {
BTW, the number of permutations, as you can see by the table, is the number of names times the number of areas.
Upvotes: 1