Alex
Alex

Reputation: 245

Randomly sleeting rows based on all groups in two columns

I have a large dataset with about 167k rows. I would like to take a sample of 2000 rows of it while making sure I am taking rows from all groups in two columns (id & quality) in the data. This is a snapshot of the data

df <- data.frame(id=c(1,2,3,4,5,1,2),
                 quality=c("a","b","c","d","z","g","t"))

df %>% glimpse()
Rows: 7
Columns: 2
$ id      <dbl> 1, 2, 3, 4, 5, 1, 2
$ quality <chr> "a", "b", "c", "d", "z", "g", "t"

So, I need to ensure that the sampled data has rows from all combinations of these two group columns. I hope someone can help out.

Thanks!

Upvotes: 0

Views: 322

Answers (2)

Serkan
Serkan

Reputation: 1955

If you want to make sure that each id and quality is represented in your new sample, you will need to group you data by these variables.

What you are looking for is the following,

df %>% 
        group_by(id,quality) %>% 
        sample_n(1, replace = TRUE)

You can change sample size pr group and id, and set replacement as desired.

It gives the following output,

# Groups:   id, quality [7]
     id quality
  <dbl> <chr>  
1     1 a      
2     1 g      
3     2 b      
4     2 t      
5     3 c      
6     4 d      
7     5 z 

The data that you provided, have unique groups, and therefore sampling the way you want it, gives the same number of rows as you data.


Edit: sample_n is superseeded by slice_sample, I wasnt aware of this. But you can easily change the script by,

df %>% 
        group_by(id,quality) %>% 
        slice_sample(
                n = 1
        )

You can also sample a proportion of your data.frame by setting prop instead of n,

df %>% 
        group_by(id,quality) %>% 
        slice_sample(
                prop = 0.25
        )

Upvotes: 1

DataM
DataM

Reputation: 351

I think that's what you're looking for.

my_df <- data.frame(id = c(1, 2, 3, 4, 5, 1, 2, 2, 2),
                    quality = c("a", "b", "c", "d", "z", "g", "t", "t", "t"))

my_df <- my_df %>% group_by(id, quality) %>% mutate(Unique = cur_group_id())
my_df$Test <- seq.int(from = 1, to = nrow(my_df), by = 1)

my_a <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_b <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_c <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_d <- my_df %>% group_by(Unique) %>% sample_n(., 1)
my_e <- my_df %>% group_by(Unique) %>% sample_n(., 1)

You don't need that much dataframe, those are just examples to show that for each unique group 1 row will be extract randomly. The difference is seen in the column named "Test" especially for the id = 2 and quality = t, based on the data sample.

Upvotes: 2

Related Questions