Random Sample by Group and Specific Probability Distribution for Groups

Question

I have data set with all dates in 2021 and I would like to create random samples of repeating dates in each month. The distribution of dates within each month should follow a certain pattern that mirrors a specific percentage of day of the week. For example, I would like to generate 1000 dates from January 2021 and approximately 8% or 80 of these days should be Mondays. Please consider the following working example:

dt2021 <- 
  tibble(SalesDate = seq.Date(
    ymd("2021-01-01"), 
    ymd("2021-12-31"), 1)) %>%
  mutate(
    wkDay=weekdays(SalesDate),
    year=year(SalesDate), 
    month=month(SalesDate))
dt2021 %>% glimpse()

dtWkDays <- tibble(
  wkDay=c("Monday", "Tuesday", "Wednesday",
          "Thursday", "Friday", "Saturday",
          "Sunday"),
  Freq=c(0.08, 0.07, 0.09, 0.12, 0.31, 0.32, 
         0.01))
dtWkDays

My pseudo script for what I am trying to do would look something like the following:

set.seed(123)
dt2021_01 <- dt2021 %>% filter(month==1) %>%  
    # generate a random sample of 1000 dates
    # use the wkday in dtWkDays for the grouping (stratification)
    # use the Freq in dtWkDays for the weights
    # resample = T

If the solution is correct, the following R script should produce around 80 Mondays, 70 Tuesdays, 90 Wednesdays, 120 Thursdays, etc.

dt2021_01 %>% count(wkDay)

I have tried several combinations using slice_sample, sample_frac, and group_by, weight_by, etc., and nothing has generated the correct results for me.

Ben · Accepted Answer

I believe this might work. Join the frequency tibble with your date tibble. After filtering for the month of interest, a revised frequency can be calculated based on frequency for day of the week, adjusting for number of times that day of the week appears in that month. Finally, use slice_sample with this new frequency included as weight_by (weights add up to 1, though they otherwise would be standardized to add up to 1 anyways).

library(tidyverse)

set.seed(123)

dt2021 %>%
  filter(month == 1) %>%
  left_join(dtWkDays) %>%
  group_by(wkDay) %>%
  mutate(newFreq = Freq / n()) %>%
  ungroup() %>%
  slice_sample(n = 1000, weight_by = newFreq, replace = TRUE) %>%
  count(wkDay)

Output

  wkDay         n
       
1 Friday      312
2 Monday       81
3 Saturday    320
4 Sunday       10
5 Thursday    120
6 Tuesday      62
7 Wednesday    95

Random Sample by Group and Specific Probability Distribution for Groups

Answers (1)

Related Questions