rk567
rk567

Reputation: 289

Select subset of unique random records in R

I want to select a subset of n random records from a dataframe but I want unique values based on a column. For instance, from the dataset

X1 X2
1  4
1  5
1  6
2  44
2  55
3  444
3  555
3  666
3  777

From this for n=3, I do not want something like:

X1  X2
 3 777
 3 555
 2  55

where two records are from the same seed X1 = 3 But I want something like:

X1  X2
 1  5
 2  44
 3  555

How do I do this?

I tried the following:

df <- data.frame(matrix(c(1,1,1,2,2,3,3,3,3,4,4,4,5,5,5,5,5,4,5,6,44,55,444,555,666,777,4444,5555,6666,10,20,30,40,50),nrow=17,ncol=2))
df.colnames = c("x1","x2")
df[sample(nrow(df),3),]

But it doesn't seem to give me what I want. How do I tweak sample to get what I want? Or should I use a different function for subsetting

Edit Please note that my df is going to be about 50 million records and I may want to sample 1 million of these. (like 1 m unique data points). Which method would be the most efficient?

Upvotes: 0

Views: 178

Answers (3)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193527

You can use the stratified function from my "splitstackshape" package, like this:

library(splitstackshape)
set.seed(1) ## so you can reproduce this
stratified(df, "X1", 1)
#    X1  X2
# 1:  1   4
# 2:  2  44
# 3:  3 666

Alternatively, you can use sample_n from "dplyr":

library(dplyr)
set.seed(1) ## again, just to reproduce this
df %>% group_by(X1) %>% sample_n(1)
# Source: local data frame [3 x 2]
# Groups: X1
# 
#   X1  X2
# 1  1   4
# 2  2  44
# 3  3 666

Regarding your note, here are some quick timings on my system for 20M rows:

set.seed(1)
df <- data.frame(X1 = sample(1000000, 20000000, TRUE), 
                 X2 = rnorm(20000000))
dim(df)
# [1] 20000000        2

system.time(df %>% group_by(X1) %>% sample_n(1))
#   user  system elapsed 
# 39.687   0.365  40.583 
system.time(as.data.table(df)[, list(X2=sample(X2,1)), by=X1])
#   user  system elapsed 
# 10.792   0.156  11.033 
system.time(stratified(df, "X1", 1))
#   user  system elapsed 
# 12.351   0.455  12.895 

(Of course, stratified will also give you other bells and whistles out of the box, like dynamic subsetting, taking samples proportional to the size of the groups, and so on :-) )

Upvotes: 4

jazzurro
jazzurro

Reputation: 23574

This could be another way using dplyr.

group_by(df, X1) %>%
sample_n(1)

#  X1  X2
#1  1   5
#2  2  55
#3  3 777

Upvotes: 3

akrun
akrun

Reputation: 887148

Try

 set.seed(1)
 aggregate(X2~X1, df, sample, 1)
 #   X1  X2
 #1  1   4
 #2  2  44
 #3  3 666

Or using data.table

 set.seed(1)
 setDT(df)[, list(X2=sample(X2,1)), by=X1]
 #  X1  X2
 #1:  1   4
 #2:  2  44
 #3:  3 666

Upvotes: 3

Related Questions