Reputation: 289
I want to select a subset of n random records from a dataframe but I want unique values based on a column. For instance, from the dataset
X1 X2
1 4
1 5
1 6
2 44
2 55
3 444
3 555
3 666
3 777
From this for n=3, I do not want something like:
X1 X2
3 777
3 555
2 55
where two records are from the same seed X1 = 3 But I want something like:
X1 X2
1 5
2 44
3 555
How do I do this?
I tried the following:
df <- data.frame(matrix(c(1,1,1,2,2,3,3,3,3,4,4,4,5,5,5,5,5,4,5,6,44,55,444,555,666,777,4444,5555,6666,10,20,30,40,50),nrow=17,ncol=2))
df.colnames = c("x1","x2")
df[sample(nrow(df),3),]
But it doesn't seem to give me what I want. How do I tweak sample to get what I want? Or should I use a different function for subsetting
Edit Please note that my df is going to be about 50 million records and I may want to sample 1 million of these. (like 1 m unique data points). Which method would be the most efficient?
Upvotes: 0
Views: 178
Reputation: 193527
You can use the stratified
function from my "splitstackshape" package, like this:
library(splitstackshape)
set.seed(1) ## so you can reproduce this
stratified(df, "X1", 1)
# X1 X2
# 1: 1 4
# 2: 2 44
# 3: 3 666
Alternatively, you can use sample_n
from "dplyr":
library(dplyr)
set.seed(1) ## again, just to reproduce this
df %>% group_by(X1) %>% sample_n(1)
# Source: local data frame [3 x 2]
# Groups: X1
#
# X1 X2
# 1 1 4
# 2 2 44
# 3 3 666
Regarding your note, here are some quick timings on my system for 20M rows:
set.seed(1)
df <- data.frame(X1 = sample(1000000, 20000000, TRUE),
X2 = rnorm(20000000))
dim(df)
# [1] 20000000 2
system.time(df %>% group_by(X1) %>% sample_n(1))
# user system elapsed
# 39.687 0.365 40.583
system.time(as.data.table(df)[, list(X2=sample(X2,1)), by=X1])
# user system elapsed
# 10.792 0.156 11.033
system.time(stratified(df, "X1", 1))
# user system elapsed
# 12.351 0.455 12.895
(Of course, stratified
will also give you other bells and whistles out of the box, like dynamic subsetting, taking samples proportional to the size of the groups, and so on :-) )
Upvotes: 4
Reputation: 23574
This could be another way using dplyr
.
group_by(df, X1) %>%
sample_n(1)
# X1 X2
#1 1 5
#2 2 55
#3 3 777
Upvotes: 3
Reputation: 887148
Try
set.seed(1)
aggregate(X2~X1, df, sample, 1)
# X1 X2
#1 1 4
#2 2 44
#3 3 666
Or using data.table
set.seed(1)
setDT(df)[, list(X2=sample(X2,1)), by=X1]
# X1 X2
#1: 1 4
#2: 2 44
#3: 3 666
Upvotes: 3