Random row selection based on a column value and probability

Question

Here is the dummy set

df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)


1                   -1.22049503   blue
2                    1.61641224   blue
3                    0.09079087   blue
4                    0.32325956   blue
5                   -0.62733486    red
6                    0.43102051    red
7                    0.61619844    red
8                   -0.17718356    red
9                    1.18737562 yellow
10                  -0.19035444 yellow
11                  -0.49158052 yellow
12                  -1.47425432 yellow
13                   0.22942192   pink
14                   0.76779548   pink
15                   0.97631652   pink
16                  -0.33513712   pink

what I am trying to get is like if the df$color is blue then those rows will be selected, but if the df$color is blue then it got higher probability of getting that row selected, if df$color is yellow then it got lesser probability of getting that row selected, and if df$color is pink then it got very less probability of getting that row selected

This is what I came up with

my.data.frame <- df[(df$color == 'pink') | (df$color == 'blue') & runif(1) < .6 | (df$color == 'red') & runif(1) < .6|(df$color == 'yellow') & runif(1) < .3, ]

But here is the output in 2 runs

1                   -1.22049503  blue
2                    1.61641224  blue
3                    0.09079087  blue
4                    0.32325956  blue
13                   0.22942192  pink
14                   0.76779548  pink
15                   0.97631652  pink
16                  -0.33513712  pink

In second run

1                   -1.22049503  blue
2                    1.61641224  blue
3                    0.09079087  blue
4                    0.32325956  blue
5                   -0.62733486   red
6                    0.43102051   red
7                    0.61619844   red
8                   -0.17718356   red
13                   0.22942192  pink
14                   0.76779548  pink
15                   0.97631652  pink
16                  -0.33513712  pink

So here the blue rows are always getting selected as expected, but the other rows say all the red rows are selected in first run, in second run all the pink and all the red rows are selected - instead of some in red and even less in pink.

What am I missing? or any better way of doing this?

Rui Barradas · Accepted Answer

I believe that you are making a mistake when producing just one runif per group of color. In what follows I do it one step at a time in order for the code to be more clear.

And I set the RNG seed first.

set.seed(4287)    # make the results reproducible

df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)

Now for the selection.

n <- nrow(df)

blue <- df$color == 'blue'
red <- df$color == 'red'
yellow <- df$color == 'yellow'
pink <- df$color == 'pink'

inx1 <- (blue | red) & runif(n) < 0.6
inx2 <- yellow & runif(n) < 0.3
inx3 <- pink & runif(n) < 0.1

df[inx1 | inx2 | inx3, ]
#            X1         X2  color
#1  -0.85857648  1.0293620   blue
#4  -0.57829575  0.8344532   blue
#5  -0.48677993  1.2926264   blue
#6  -1.43502687 -0.1426327   blue
#7   1.30722272  0.4138376   blue
#8  -1.31555715  0.9674004   blue
#9  -2.00829490 -0.4191471   blue
#12 -0.04129173 -0.3498928    red
#14  0.44029645 -1.2079088    red
#16 -1.45220640  1.9970560    red
#18 -0.63078352 -0.0219340    red
#19 -0.34640599 -0.6622532    red
#20  0.48505620  0.4545426    red
#22 -1.54078662 -0.4094573 yellow
#26 -0.92234468  1.7194836 yellow
#29 -0.19507474 -0.1937266 yellow
#34  0.19274923  0.7879300   pink
#35  0.43921280 -0.9091608   pink
#37 -0.20192350 -0.5766637   pink

Random row selection based on a column value and probability

Answers (2)

Related Questions