Reputation: 151
I need to divide data in DF1 in to groups based on their class. In some cases everything in class will be in the same group. Class needs to be divided in to groups by random, but not by equal shares. In DF2 i have the data that gives the shares how the data needs to be divided. DF2 is imported by me from excel. This file is mentained by me and if needed you can make changes to the structure of the data. This is the file I will use to divide the classes in to groups. Share column tells me how many of the class must be divided in to this group. For example 50% of rows in DF1 with class 1 must be dividend in to Apples, 25% in to Hammers and 25% in to Car. NB! It needs to be random, it cant be that first 50% rows are Apples, next 25% hammers etc.
My solution is to give every row in DF1 a random number that I save every time i make it so i can go back and use the seed I got before. NB! It’s important to me that I can go back to the previouse random if a colleague or I runs the code by mistake and making a new random seed. I have that part covered in the case of the random number.
DF1 (base data) ID Class Random 1 1 0,65 2 1 0,23 3 2 0,45 4 1 0,11 5 2 0,89 6 3 0,12 7 1 0,9
My solution is to make a share_2 column where i divide 0-1 in to spaces based on the share column. In excel logic i would like to do the following:
IF Class = 1 then
IF Random < 0,5; Apples; if not then
IF Random < 0,75; Hammer if not then
IF Random <1; Car
DF2 (Classification file made by me) Class Group Share Share_2 1 Apples 50%* 0,5 1 Hammer 25% 0,75 1 Car 25% 1 2 Building 100%** 1 3 Computer 50% 0,5 3 Hammer 50% 1 *This means that 50% of class 1 need to be "Apples". Shares in a class give 100% in total.
I need
DF3 ID Class Random Group 1 1 0,65 Hammer 2 1 0,23 Apples 3 2 0,45 Building 4 1 0,11 Apples 5 2 0,89 Building 6 3 0,12 Computer 7 1 0,9 Car
My probleem is that i don’t know how to write it in R. Can you please help me. NB! Please feel free to offer also ohter methods of solving my problem as long as it makes the dividing of class by random and i can save the randomnes to replicate it.
Upvotes: 1
Views: 406
Reputation: 1387
One way to go about this that does not use the random numbers you have already generated, but is otherwise fairly short, is to use the random()
function to do the random assignment directly for you:
DF1 <- data.frame(
ID = 1:7,
Class = c(1, 1, 2, 1, 2, 3, 1),
Random = c(0.65, 0.23, 0.45, 0.11, 0.89, 0.12, 0.9)
)
DF1 <- DF1[order(DF1$Class), ] #EDIT: need this for the code to behave properly!
DF2 <- data.frame(
Class = c(1, 1, 1, 2, 3, 3),
Group = c("Apples", "Hammer", "Car", "Building", "Computer", "Hammer"),
Share = c(0.5, 0.25, 0.25, 1, 0.5, 0.5),
Share_2 = c(0.5, 0.75, 1, 1, 0.5, 1)
)
set.seed(12345) # this is for reproducibility; you can choose any number here
DF3 <- DF1
DF3$Group <- unlist(sapply(unique(DF1$Class), function(x) {
with(DF2[DF2$Class == x, ],
sample(Group, size = sum(DF3$Class == x),
prob = Share, replace = TRUE))
}))
Working from the outside in: the sapply
parameter serves essentially the role of a for
loop. It begins by looking at all the unique entries in DF1$Class
. For each of those (called x
), it carves out a chunk of DF2
corresponding to the portion that has Class
equal to x
, and then focuses only on that chunk of DF2
-- this is what the with()
function is doing here.
The core idea is to use sample()
. We draw the things to sample from the Group
column of DF2
, draw an appropriate number of samples (marked by the size
parameter), set the probabilities according to the Share
column of DF2
, and draw with replacement. All of this makes sense because we are inside the with()
function; we have already restricted our attention to not only DF2
, but just the chunk of DF2
corresponding to Class == x
.
The unlist()
function is used because the output of the sapply()
function is a list in this case, and we want it just to be a vector; then, we just glue that vector directly onto the DF3
data frame, which is otherwise an identical copy of DF1
.
EDIT: I added a line sorting DF1
, which is necessary for this solution.
Upvotes: 1
Reputation: 16988
Actually I don't like this solution since I pipe two filter
-functions and don't know how to do it in one statement.
Using dplyr
and @Aaron Montgomery's data:
merge(DF1, DF2, by="Class") %>%
group_by(Class, ID) %>%
filter(Random <= Share_2) %>%
filter(Share_2 == min(Share_2)) %>%
select(-c(Share, Share_2)) %>%
arrange(ID)
gives
# A tibble: 7 x 4
# Groups: Class, ID [7]
Class ID Random Group
<dbl> <int> <dbl> <chr>
1 1 1 0.65 Hammer
2 1 2 0.23 Apples
3 2 3 0.45 Building
4 1 4 0.11 Apples
5 2 5 0.89 Building
6 3 6 0.12 Computer
7 1 7 0.9 Car
Upvotes: 0