Reputation:
Here is the dummy set
df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <- rep(c("blue", "red", "yellow", "pink"), each=10)
1 -1.22049503 blue
2 1.61641224 blue
3 0.09079087 blue
4 0.32325956 blue
5 -0.62733486 red
6 0.43102051 red
7 0.61619844 red
8 -0.17718356 red
9 1.18737562 yellow
10 -0.19035444 yellow
11 -0.49158052 yellow
12 -1.47425432 yellow
13 0.22942192 pink
14 0.76779548 pink
15 0.97631652 pink
16 -0.33513712 pink
what I am trying to get is like if the df$color is blue then those rows will be selected, but if the df$color is blue then it got higher probability of getting that row selected, if df$color is yellow then it got lesser probability of getting that row selected, and if df$color is pink then it got very less probability of getting that row selected
This is what I came up with
my.data.frame <- df[(df$color == 'pink') | (df$color == 'blue') & runif(1) < .6 | (df$color == 'red') & runif(1) < .6|(df$color == 'yellow') & runif(1) < .3, ]
But here is the output in 2 runs
1 -1.22049503 blue
2 1.61641224 blue
3 0.09079087 blue
4 0.32325956 blue
13 0.22942192 pink
14 0.76779548 pink
15 0.97631652 pink
16 -0.33513712 pink
In second run
1 -1.22049503 blue
2 1.61641224 blue
3 0.09079087 blue
4 0.32325956 blue
5 -0.62733486 red
6 0.43102051 red
7 0.61619844 red
8 -0.17718356 red
13 0.22942192 pink
14 0.76779548 pink
15 0.97631652 pink
16 -0.33513712 pink
So here the blue rows are always getting selected as expected, but the other rows say all the red rows are selected in first run, in second run all the pink and all the red rows are selected - instead of some in red and even less in pink.
What am I missing? or any better way of doing this?
Upvotes: 0
Views: 331
Reputation: 13135
Using tidyverse
approch
library(purrr)
library(dplyr)
library(tidyr)
Sample_df <- df %>%
group_by(color) %>%
nest() %>%
mutate(Prob = c(.8, .6, .4, .2), samp = map2(data, Prob, sample_frac)) %>%
select(color, samp) %>% unnest()
Upvotes: 1
Reputation: 76651
I believe that you are making a mistake when producing just one runif
per group of color. In what follows I do it one step at a time in order for the code to be more clear.
And I set the RNG seed first.
set.seed(4287) # make the results reproducible
df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <- rep(c("blue", "red", "yellow", "pink"), each=10)
Now for the selection.
n <- nrow(df)
blue <- df$color == 'blue'
red <- df$color == 'red'
yellow <- df$color == 'yellow'
pink <- df$color == 'pink'
inx1 <- (blue | red) & runif(n) < 0.6
inx2 <- yellow & runif(n) < 0.3
inx3 <- pink & runif(n) < 0.1
df[inx1 | inx2 | inx3, ]
# X1 X2 color
#1 -0.85857648 1.0293620 blue
#4 -0.57829575 0.8344532 blue
#5 -0.48677993 1.2926264 blue
#6 -1.43502687 -0.1426327 blue
#7 1.30722272 0.4138376 blue
#8 -1.31555715 0.9674004 blue
#9 -2.00829490 -0.4191471 blue
#12 -0.04129173 -0.3498928 red
#14 0.44029645 -1.2079088 red
#16 -1.45220640 1.9970560 red
#18 -0.63078352 -0.0219340 red
#19 -0.34640599 -0.6622532 red
#20 0.48505620 0.4545426 red
#22 -1.54078662 -0.4094573 yellow
#26 -0.92234468 1.7194836 yellow
#29 -0.19507474 -0.1937266 yellow
#34 0.19274923 0.7879300 pink
#35 0.43921280 -0.9091608 pink
#37 -0.20192350 -0.5766637 pink
Upvotes: 0