Random sample by group and filtering on the basis of result

Question

I have a dataframe that is generated by the following code

l_ids = c(1, 1, 1, 2, 2, 2, 2)
l_months = c(5, 5, 5, 88, 88, 88, 88)
l_calWeek = c(201708, 201709, 201710, 201741, 201742, 201743, 201744)
value = c(5, 6, 3, 99, 100, 1001, 1002)

dat <- setNames(data.frame(cbind(l_ids, l_months, l_calWeek, value)), 
c("ids", "months", "calWeek", "value"))

and looks like this:

+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
|  1 |     5 |   201708 |   4.5 |
|  1 |     5 |   201709 |     5 |
| 1  |     5 |   201710 |     6 |
|  2 |    88 |   201741 |    75 |
|  2 |    88 | 201742   |    89 |
| 2  |    88 | 201743   |    90 |
|  2 |    88 |   201744 |    51 |
+----+-------+----------+-------+

I would like to randomly sample a calendar week from each id-month group (the months are not calendar months). Then I would like to keep all id-month combination prior to the sample months.

An example output could be: suppose the sampling output returned cal week 201743 for the group id=2 and month=88 and 201709 for the group id=1 and month=5, then the final ouput should be

+----+-------+----------+-------+
| Id | Month | Cal Week | Value |
+----+-------+----------+-------+
|  1 |     5 |   201708 |   4.5 |
|  1 |     5 |   201709 |     5 |
|  2 |    88 |   201741 |    75 |
|  2 |    88 | 201742   |    89 |
   2  |    88 | 201743   |    90 |

+----+-------+----------+-------+

I tried to work with dplyr's sample_n function (which is going to give me the random calendar week by id-month group, but then I do not know how to get all calendar weeks prior to that date. Can you help me with this. If possible, I would like to work with dplyr.

Please let me know in case you need further information.

Many thanks

tjebo · Accepted Answer

require(dplyr) 
set.seed(1)     # when sampling please set.seed
sampled <- dat %>% group_by(ids) %>% do(., sample_n(.,1)) 

sampled_day <- sampled$calWeek

dat %>% group_by(ids) %>% 
  mutate(max_day = which(calWeek %in% sampled_day)) %>%
  filter(row_number() <= max_day)

#You can also just filter directly with row_number() <= which(calWeek %in% sampled_day)

# A tibble: 3 x 4
# Groups:   ids [2]
    ids months calWeek  value
         
1  1.00   5.00  201708   5.00
2  2.00  88.0   201741  99.0 
3  2.00  88.0   201742 100

This depends on the row order! So make sure to arrange by day first. You'll need to think about ties, though. I have edited my previous answer and simply filtered with <=

Random sample by group and filtering on the basis of result

Answers (2)

Related Questions