Art
Art

Reputation: 59

R sampling with if statement and similar number of sample

I need to to create a sample from my dataframe and to do so I am using the code bellow.

 name <- sample(c("Adam","John","Henry","Mike"),100,rep = TRUE)
 area <- sample(c("run","develop","test"),100,rep = TRUE)
 id <- sample(100:200,100,rep = FALSE)

 mydata <- as.data.frame(cbind(id,area,name))


qcsample <- mydata %>%
  group_by(area) %>% 
  nest() %>%            
  mutate(n = c(20, 15, 15)) %>% 
  mutate(samp = map2(data, n, sample_n)) %>% 
  select(area, samp) %>%
  unnest()

Now, I am getting these results.

table(qcsample$area) 

develop     run    test 
     15      15      20 

--

table(qcsample$name)

Adam Henry  John  Mike 

    9     9    16    16 

I would like to create a sample that would have more or less the same number of samples for each name eg. Adam - 12, Henry - 12, John - 13, Mike - 13. How can I achieve that ? can I somehow request that the sample is equally distributed ?

Also, in this example I used function

sample_n

and specified number of samples.

I am anticipating that sometimes there will not be required number from a given group. In my example I am taking 20 samples from area called "test" but sometimes there will be only let's say 10 rows containing "test". The total number is 50 so I need to make sure if there are only 10 "test" the code has to automatically increase the others, so the sample would be "test" - 10, "run" - 20 and "develop" - 20. This can happen to any of the area so I need to test if there is enough rows to create the sample and increase other areas. If there is only 1 it can be added to any of the remaining areas or if the difference is 3 we add 1 to one area and 2 to the another one.

How could I check that taking into account all the possibilities ? I believe there are eight permutations in this case.

Thanks in advance A.

Upvotes: 2

Views: 297

Answers (2)

r2evans
r2evans

Reputation: 160447

Here's another thought.

Depending on your desired end-size, it might over-create the number of samples so that it can reduce some name/area pairs to bring the total down.

Let's say that you want to end up with a total of 50 rows:

final_size <- 50

For completeness, here are the sets from which we'll choose:

avail_names <- c("Adam", "John", "Henry", "Mike")
avail_areas <- c("run", "develop", "test")

and the minimum we need to create for Adam,run (etc) in order to certainly end up with no less than final_size rows:

size_per_namearea <- ceiling(final_size / (length(avail_names) * length(avail_areas)))

Ok, generate at least as many (likely more than) the number of rows we need:

set.seed(20180920)
qcsample <- crossing(data_frame(rownum = seq_len(size_per_namearea)),
                     data_frame(name   = avail_names),
                     data_frame(area   = avail_areas)) %>%
  group_by(name, area) %>%
  mutate(id = sample(100, size = n(), replace = FALSE))
qcsample
# # A tibble: 60 x 4
# # Groups:   name, area [12]
#    rownum name  area       id
#     <int> <chr> <chr>   <int>
#  1      1 Adam  run        59
#  2      1 Adam  develop    51
#  3      1 Adam  test       23
#  4      1 John  run        71
#  5      1 John  develop     5
#  6      1 John  test       24
#  7      1 Henry run         4
#  8      1 Henry develop    29
#  9      1 Henry test       79
# 10      1 Mike  run        77
# # ... with 50 more rows

Verify we have identical sample sizes for each name/area:

xtabs(~ name + area, data = qcsample) %>%
  stats::addmargins()
#        area
# name    develop run test Sum
#   Adam        5   5    5  15
#   Henry       5   5    5  15
#   John        5   5    5  15
#   Mike        5   5    5  15
#   Sum        20  20   20  60

If we just do head(final_size), then we know which names we will be cutting short, which undermines the randomness of your sampling a little. The reason I added rownum up front was so that I can arrange by it plus a jitter, ensuring I get all of max(rownum)-1, and then some sampling of max(rownum), guaranteeing that each name/area pair have either max(rownum)-1 or max(rownum) rows; your tallies are never different by more than 1.

reducedsample <- arrange(qcsample, rownum + runif(n()))  %>%
  head(final_size) %>%
  select(-rownum)
reducedsample %>%
  xtabs(~ name + area, data = .) %>%
  stats::addmargins()
#        area
# name    develop run test Sum
#   Adam        4   4    5  13
#   Henry       5   4    4  13
#   John        4   4    4  12
#   Mike        4   4    4  12
#   Sum        17  16   17  50

Upvotes: 1

keithpjolley
keithpjolley

Reputation: 2263

If you are using made up data then you can create a minimum amount of each row and then create filler to get you up to the total:

set.seed(42)

names <- c("Adam", "John", "Henry", "Mike")
areas <- c("run", "develop", "test")

totalrows <- 100
minname   <-  22 # No less than 20 of each name (set to near threshold to test)
minarea   <-  30 # No less than 30 of each area (less randomness the higher these are)

qcsample <- data.frame(
  name=sample(c(rep(names, minname), sample(names, totalrows-length(names)*minname, replace=T))),
  area=sample(c(rep(areas, minarea), sample(areas, totalrows-length(areas)*minarea, replace=T))),
  id=sample(99+(1:totalrows))
)

This results in:

R> table(qcsample$name)

 Adam Henry  John  Mike 
   23    28    24    25 
R> table(qcsample$area)

develop     run    test 
     37      31      32

Notice that the count of name to area isn't constrained:

R> table(qcsample[,-3])
       area
name    develop run test
  Adam        5  11    7
  Henry      11   8    9
  John       10   7    7
  Mike       11   5    9
R> 

Using a loop as suggested by @r2evans:

library(dplyr)
set.seed(42)

mydata <- data.frame(
  name = sample(c("Adam","John","Henry","Mike"), 100, rep = TRUE),
  area = sample(c("run","develop","test"), 100, rep = TRUE),
  id   = sample(100:200, 100, rep = FALSE)
)

Nsamples <- 50
mysample <- data.frame(sample_n(mydata, Nsamples))

minname <- 11  # max is 50/4 -> 12 
minarea <- 15  # max is 50/3 -> 16

# the test you were asking about
while( (min(table(mysample$name)) < minname) || (min(table(mysample$area)) < minarea) ) {
  mysample <- data.frame(sample_n(mydata, Nsamples))
}

This results in:

R> table(mysample$name)

 Adam Henry  John  Mike 
   13    15    11    11 

R> table(mysample$area)

develop     run    test 
     15      17      18 

And, like before, there's no minimum of name to area.

R> table(mysample[-3])
       area
name    develop run test
  Adam        4   3    6
  Henry       2   6    7
  John        4   4    3
  Mike        5   4    2

If you needed to enforce an minimum number for each permutation add this to the test:

while(... || (min(table(mysample[-3])) < some_min)) {

BTW, the number of permutations, as you can see by the table, is the number of names times the number of areas.

Upvotes: 1

Related Questions