krish
krish

Reputation: 1438

Randomly assign value following specific distribution within group in R

I have a dataset like the one below:

spend <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 
29, 30), Dept = c("IT", "HR", "Marketing", "HR", "IT", "IT", 
"Marketing", "IT", "Marketing", "Marketing", "IT", "IT", "HR", 
"IT", "Marketing", "Marketing", "Marketing", "HR", "HR", "IT", 
"IT", "Marketing", "IT", "Marketing", "Marketing", "IT", "HR", 
"IT", "Marketing", "IT")), .Names = c("ID", "Dept"), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -30L))

I have a list like this one:

rating <- c("Outstanding", "Exceeds Expectation", "Achieves Expectations", "Needs Improvement")

I want to add a new column to the dataset where I randomly assign one of the values from rating based on a distribution. I want 5% values to be Outstanding, 25% as Exceeds Expectation, 67% as Achieves Expectations and 3% as Needs Improvement but within each group under the Dept. So each Dept will be randomly assigned these values but with the specific distribution.

I am unable to get the specific distribution and grouping using the sample function.

spend$Rating <- sample(rating, nrow(spend), replace = TRUE)

head(spend, 10)
# A tibble: 10 x 3
      ID      Dept                Rating
   <dbl>     <chr>                 <chr>
 1     1        IT     Needs Improvement
 2     2        HR Achieves Expectations
 3     3 Marketing   Exceeds Expectation
 4     4        HR     Needs Improvement
 5     5        IT     Needs Improvement
 6     6        IT              Rockstar
 7     7 Marketing Achieves Expectations
 8     8        IT     Needs Improvement
 9     9 Marketing     Needs Improvement
10    10 Marketing   Exceeds Expectation

This obviously does not maintain the distribution within the groups. Any inputs on this?

Upvotes: 0

Views: 1360

Answers (2)

duckmayr
duckmayr

Reputation: 16940

thelatemail was right to point out sample()'s prob argument, and Patricio Moracho's solution uses that to give you the overall distribution you want, but you need to have this distribution within each group, so here's a dplyr solution for that (since you already have it in a tibble):

library(dplyr)
spend <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
                               15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
                               27, 28,  29, 30),
                        Dept = c("IT", "HR", "Marketing", "HR", "IT", "IT",
                                 "Marketing", "IT", "Marketing", "Marketing",
                                 "IT", "IT", "HR", "IT", "Marketing",
                                 "Marketing", "Marketing", "HR", "HR", "IT",
                                 "IT", "Marketing", "IT", "Marketing",
                                 "Marketing", "IT", "HR", "IT", "Marketing",
                                 "IT")), .Names = c("ID", "Dept"),
                   class = c("tbl_df",  "tbl", "data.frame"),
                   row.names = c(NA, -30L))
rating <- c("Outstanding", "Exceeds Expectation",
            "Achieves Expectations", "Needs Improvement")
probs <- c(0.05, 0.25, 0.67, 0.03)

set.seed(123)
spend %>%
    group_by(Dept) %>% # note the group_by()
    mutate(Rating=sample(rating, size=n(), prob=probs, replace=TRUE)) %>%
    arrange(Dept) %>%
    print(n=nrow(.))

# A tibble: 30 x 3
# Groups:   Dept [3]
      ID      Dept                Rating
   <dbl>     <chr>                 <chr>
 1     2        HR Achieves Expectations
 2     4        HR   Exceeds Expectation
 3    13        HR Achieves Expectations
 4    18        HR   Exceeds Expectation
 5    19        HR           Outstanding
 6    27        HR Achieves Expectations
 7     1        IT Achieves Expectations
 8     5        IT   Exceeds Expectation
 9     6        IT Achieves Expectations
10     8        IT Achieves Expectations
11    11        IT           Outstanding
12    12        IT Achieves Expectations
13    14        IT   Exceeds Expectation
14    20        IT Achieves Expectations
15    21        IT Achieves Expectations
16    23        IT   Exceeds Expectation
17    26        IT Achieves Expectations
18    28        IT Achieves Expectations
19    30        IT Achieves Expectations
20     3 Marketing           Outstanding
21     7 Marketing   Exceeds Expectation
22     9 Marketing   Exceeds Expectation
23    10 Marketing Achieves Expectations
24    15 Marketing     Needs Improvement
25    16 Marketing Achieves Expectations
26    17 Marketing   Exceeds Expectation
27    22 Marketing Achieves Expectations
28    24 Marketing Achieves Expectations
29    25 Marketing Achieves Expectations
30    29 Marketing Achieves Expectations

EDIT:

I realize you may also want to double check that the frequencies shake out in a larger sample with this approach within groups:

spend <- do.call("rbind", replicate(100, spend, simplify = FALSE))

set.seed(123)
spend <- spend %>%
    group_by(Dept) %>% # note the group_by()
    mutate(Rating=sample(rating, size=n(), prob=probs, replace=TRUE))

spend %>%
    group_by(Dept, Rating) %>%
    summarise(n=n()) %>%
    mutate(freq=n/sum(n))

# A tibble: 12 x 4
# Groups:   Dept [3]
        Dept                Rating     n       freq
       <chr>                 <chr> <int>      <dbl>
 1        HR Achieves Expectations   405 0.67500000
 2        HR   Exceeds Expectation   148 0.24666667
 3        HR     Needs Improvement    21 0.03500000
 4        HR           Outstanding    26 0.04333333
 5        IT Achieves Expectations   878 0.67538462
 6        IT   Exceeds Expectation   325 0.25000000
 7        IT     Needs Improvement    37 0.02846154
 8        IT           Outstanding    60 0.04615385
 9 Marketing Achieves Expectations   738 0.67090909
10 Marketing   Exceeds Expectation   281 0.25545455
11 Marketing     Needs Improvement    29 0.02636364
12 Marketing           Outstanding    52 0.04727273

Upvotes: 2

Patricio Moracho
Patricio Moracho

Reputation: 717

As thelatemail mentioned, you can use sample() indicating the probability you want, as follows:

spend$Rating <- sample(rating, nrow(spend), replace = TRUE, prob=c(5,25,67,3))

Testing:

# Just for testing, a lot of more items to verify sample
spend <- do.call("rbind", replicate(100, spend, simplify = FALSE))

set.seed(100)
spend$Rating <- sample(rating, nrow(spend), replace = TRUE, prob=c(5,25,67,3))
aggregate(spend$Rating, by=list(spend$Rating), function(x) length(x)/nrow(spend)*100)

                Group.1         x
1 Achieves Expectations 67.000667
2   Exceeds Expectation 24.996333
3     Needs Improvement  2.995333
4           Outstanding  5.007667

Upvotes: 0

Related Questions