In R how to match with multiple conditions?

Question

I need to divide data in DF1 in to groups based on their class. In some cases everything in class will be in the same group. Class needs to be divided in to groups by random, but not by equal shares. In DF2 i have the data that gives the shares how the data needs to be divided. DF2 is imported by me from excel. This file is mentained by me and if needed you can make changes to the structure of the data. This is the file I will use to divide the classes in to groups. Share column tells me how many of the class must be divided in to this group. For example 50% of rows in DF1 with class 1 must be dividend in to Apples, 25% in to Hammers and 25% in to Car. NB! It needs to be random, it cant be that first 50% rows are Apples, next 25% hammers etc.

My solution is to give every row in DF1 a random number that I save every time i make it so i can go back and use the seed I got before. NB! It’s important to me that I can go back to the previouse random if a colleague or I runs the code by mistake and making a new random seed. I have that part covered in the case of the random number.

      DF1 (base data)          
ID   Class   Random     
1      1      0,65
2      1      0,23
3      2      0,45
4      1      0,11
5      2      0,89
6      3      0,12
7      1      0,9

My solution is to make a share_2 column where i divide 0-1 in to spaces based on the share column. In excel logic i would like to do the following:

IF Class = 1 then
IF Random < 0,5; Apples; if not then
IF Random < 0,75; Hammer if not then
IF Random <1; Car

 DF2  (Classification file made by me)
Class   Group          Share      Share_2
1       Apples        50%*        0,5
1       Hammer        25%         0,75
1       Car           25%         1
2       Building      100%**      1
3       Computer      50%         0,5
3       Hammer        50%         1

*This means that 50% of class 1 need to be "Apples". Shares in a class give 100% in total.

I need

    DF3
ID   Class   Random      Group    
1      1      0,65      Hammer
2      1      0,23      Apples
3      2      0,45      Building
4      1      0,11      Apples
5      2      0,89      Building
6      3      0,12      Computer
7      1      0,9       Car

My probleem is that i don’t know how to write it in R. Can you please help me. NB! Please feel free to offer also ohter methods of solving my problem as long as it makes the dividing of class by random and i can save the randomnes to replicate it.

Aaron Montgomery · Accepted Answer

One way to go about this that does not use the random numbers you have already generated, but is otherwise fairly short, is to use the random() function to do the random assignment directly for you:

DF1 <- data.frame(
  ID = 1:7,
  Class = c(1, 1, 2, 1, 2, 3, 1),
  Random = c(0.65, 0.23, 0.45, 0.11, 0.89, 0.12, 0.9)
)

DF1 <- DF1[order(DF1$Class), ]  #EDIT: need this for the code to behave properly!

DF2 <- data.frame(
  Class = c(1, 1, 1, 2, 3, 3),
  Group = c("Apples", "Hammer", "Car", "Building", "Computer", "Hammer"),
  Share = c(0.5, 0.25, 0.25, 1, 0.5, 0.5),
  Share_2 = c(0.5, 0.75, 1, 1, 0.5, 1)
)

set.seed(12345)  # this is for reproducibility; you can choose any number here

DF3 <- DF1

DF3$Group <- unlist(sapply(unique(DF1$Class), function(x) {
  with(DF2[DF2$Class == x, ], 
       sample(Group, size = sum(DF3$Class == x), 
              prob = Share, replace = TRUE))
}))

Working from the outside in: the sapply parameter serves essentially the role of a for loop. It begins by looking at all the unique entries in DF1$Class. For each of those (called x), it carves out a chunk of DF2 corresponding to the portion that has Class equal to x, and then focuses only on that chunk of DF2 -- this is what the with() function is doing here.

The core idea is to use sample(). We draw the things to sample from the Group column of DF2, draw an appropriate number of samples (marked by the size parameter), set the probabilities according to the Share column of DF2, and draw with replacement. All of this makes sense because we are inside the with() function; we have already restricted our attention to not only DF2, but just the chunk of DF2 corresponding to Class == x.

The unlist() function is used because the output of the sapply() function is a list in this case, and we want it just to be a vector; then, we just glue that vector directly onto the DF3 data frame, which is otherwise an identical copy of DF1.

EDIT: I added a line sorting DF1, which is necessary for this solution.

In R how to match with multiple conditions?

Answers (2)

Related Questions