Reputation: 1438
I have a dataset like the one below:
spend <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30), Dept = c("IT", "HR", "Marketing", "HR", "IT", "IT",
"Marketing", "IT", "Marketing", "Marketing", "IT", "IT", "HR",
"IT", "Marketing", "Marketing", "Marketing", "HR", "HR", "IT",
"IT", "Marketing", "IT", "Marketing", "Marketing", "IT", "HR",
"IT", "Marketing", "IT")), .Names = c("ID", "Dept"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -30L))
I have a list like this one:
rating <- c("Outstanding", "Exceeds Expectation", "Achieves Expectations", "Needs Improvement")
I want to add a new column to the dataset where I randomly assign one of the values from rating
based on a distribution. I want 5% values to be Outstanding
, 25% as Exceeds Expectation
, 67% as Achieves Expectations
and 3% as Needs Improvement
but within each group under the Dept
. So each Dept
will be randomly assigned these values but with the specific distribution.
I am unable to get the specific distribution and grouping using the sample
function.
spend$Rating <- sample(rating, nrow(spend), replace = TRUE)
head(spend, 10)
# A tibble: 10 x 3
ID Dept Rating
<dbl> <chr> <chr>
1 1 IT Needs Improvement
2 2 HR Achieves Expectations
3 3 Marketing Exceeds Expectation
4 4 HR Needs Improvement
5 5 IT Needs Improvement
6 6 IT Rockstar
7 7 Marketing Achieves Expectations
8 8 IT Needs Improvement
9 9 Marketing Needs Improvement
10 10 Marketing Exceeds Expectation
This obviously does not maintain the distribution within the groups. Any inputs on this?
Upvotes: 0
Views: 1360
Reputation: 16940
thelatemail was right to point out sample()
's prob
argument, and Patricio Moracho's solution uses that to give you the overall distribution you want, but you need to have this distribution within each group, so here's a dplyr
solution for that (since you already have it in a tibble):
library(dplyr)
spend <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30),
Dept = c("IT", "HR", "Marketing", "HR", "IT", "IT",
"Marketing", "IT", "Marketing", "Marketing",
"IT", "IT", "HR", "IT", "Marketing",
"Marketing", "Marketing", "HR", "HR", "IT",
"IT", "Marketing", "IT", "Marketing",
"Marketing", "IT", "HR", "IT", "Marketing",
"IT")), .Names = c("ID", "Dept"),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -30L))
rating <- c("Outstanding", "Exceeds Expectation",
"Achieves Expectations", "Needs Improvement")
probs <- c(0.05, 0.25, 0.67, 0.03)
set.seed(123)
spend %>%
group_by(Dept) %>% # note the group_by()
mutate(Rating=sample(rating, size=n(), prob=probs, replace=TRUE)) %>%
arrange(Dept) %>%
print(n=nrow(.))
# A tibble: 30 x 3
# Groups: Dept [3]
ID Dept Rating
<dbl> <chr> <chr>
1 2 HR Achieves Expectations
2 4 HR Exceeds Expectation
3 13 HR Achieves Expectations
4 18 HR Exceeds Expectation
5 19 HR Outstanding
6 27 HR Achieves Expectations
7 1 IT Achieves Expectations
8 5 IT Exceeds Expectation
9 6 IT Achieves Expectations
10 8 IT Achieves Expectations
11 11 IT Outstanding
12 12 IT Achieves Expectations
13 14 IT Exceeds Expectation
14 20 IT Achieves Expectations
15 21 IT Achieves Expectations
16 23 IT Exceeds Expectation
17 26 IT Achieves Expectations
18 28 IT Achieves Expectations
19 30 IT Achieves Expectations
20 3 Marketing Outstanding
21 7 Marketing Exceeds Expectation
22 9 Marketing Exceeds Expectation
23 10 Marketing Achieves Expectations
24 15 Marketing Needs Improvement
25 16 Marketing Achieves Expectations
26 17 Marketing Exceeds Expectation
27 22 Marketing Achieves Expectations
28 24 Marketing Achieves Expectations
29 25 Marketing Achieves Expectations
30 29 Marketing Achieves Expectations
EDIT:
I realize you may also want to double check that the frequencies shake out in a larger sample with this approach within groups:
spend <- do.call("rbind", replicate(100, spend, simplify = FALSE))
set.seed(123)
spend <- spend %>%
group_by(Dept) %>% # note the group_by()
mutate(Rating=sample(rating, size=n(), prob=probs, replace=TRUE))
spend %>%
group_by(Dept, Rating) %>%
summarise(n=n()) %>%
mutate(freq=n/sum(n))
# A tibble: 12 x 4
# Groups: Dept [3]
Dept Rating n freq
<chr> <chr> <int> <dbl>
1 HR Achieves Expectations 405 0.67500000
2 HR Exceeds Expectation 148 0.24666667
3 HR Needs Improvement 21 0.03500000
4 HR Outstanding 26 0.04333333
5 IT Achieves Expectations 878 0.67538462
6 IT Exceeds Expectation 325 0.25000000
7 IT Needs Improvement 37 0.02846154
8 IT Outstanding 60 0.04615385
9 Marketing Achieves Expectations 738 0.67090909
10 Marketing Exceeds Expectation 281 0.25545455
11 Marketing Needs Improvement 29 0.02636364
12 Marketing Outstanding 52 0.04727273
Upvotes: 2
Reputation: 717
As thelatemail mentioned, you can use sample()
indicating the probability you want, as follows:
spend$Rating <- sample(rating, nrow(spend), replace = TRUE, prob=c(5,25,67,3))
Testing:
# Just for testing, a lot of more items to verify sample
spend <- do.call("rbind", replicate(100, spend, simplify = FALSE))
set.seed(100)
spend$Rating <- sample(rating, nrow(spend), replace = TRUE, prob=c(5,25,67,3))
aggregate(spend$Rating, by=list(spend$Rating), function(x) length(x)/nrow(spend)*100)
Group.1 x
1 Achieves Expectations 67.000667
2 Exceeds Expectation 24.996333
3 Needs Improvement 2.995333
4 Outstanding 5.007667
Upvotes: 0