Reputation: 4873
I have a data frame made by almost 50,000 rows spread in 15 different IDs (every ID has thousands of observations). Data frame looks like:
ID Year Temp ph
1 P1 1996 11.3 6.80
2 P1 1996 9.7 6.90
3 P1 1997 9.8 7.10
...
2000 P2 1997 10.5 6.90
2001 P2 1997 9.9 7.00
2002 P2 1997 10.0 6.93
I want to take 500 random rows for every ID (so 500 for P1, 500 for P2,....) and create a new df. I try:
new_df<-df[df$ID %in% sample(unique(dfID),500),]
But it takes randomly one ID, while I need 500 random rows for every ID.
Upvotes: 52
Views: 66173
Reputation: 6416
In case you have big datasets, a data.table
solution could go like this:
library(data.table)
# Generate 26 mil rows random data
set.seed(2023-08-11) # anchor the random number generator (RNG) state for reproducibility
dt <- data.table(c1 = sample(length(LETTERS)*10^6),
c2 = sample(LETTERS, replace = TRUE))
# For each letter, sample 500 rows
set.seed(2023-08-11) # anchor the RNG again, as we use `sample` again
dt_sample <- dt[, .SD[sample(x = .N, size = 500)], by = c2]
# We indeed sampled 500 rows for each letter
dt_sample[, .N, by = c2][order(c2)]
#> c2 N
#> 1: A 500
#> 2: D 500
#> 3: G 500
#> 4: I 500
#> 5: M 500
#> 6: N 500
#> 7: O 500
#> 8: P 500
#> 9: Q 500
#> 10: R 500
#> 11: S 500
#> 12: T 500
#> 13: U 500
#> 14: V 500
#> 15: W 500
#> 16: Y 500
#> 17: Z 500
Created on 2019-04-23 by the reprex package (v0.2.1)
In case your data is unbalanced in the sense that some groups happen to be smaller (as number of rows) than your desired sample size, then you need to set a defensive trick like sample size should be min(500, .N)
- see sample random rows within each group in a data.table. So like:
dt[, .SD[sample(x = .N, size = min(500, .N))], by = c2]
Upvotes: 14
Reputation: 9532
This is available as the slice_sample
function in dplyr
:
library(dplyr)
new_df <- df %>% group_by(ID) %>% slice_sample(n=500)
In older versions of R, the function was called sample_n
, which has been deprecated.
Upvotes: 94
Reputation: 11
Here's an elegant solution based on data.table
. You can randomly draw IDs from a panel data set (balanced or unbalanced) in three simple steps:
Step 1: Store unique IDs from your original data set in a vector (my data set is called "main" and the identifier is called "id"):
ids <- unique(main$id)
Step 2: Randomly draw IDs from the vector from step 1. In the example below, I randomly draw 50 IDs from the vector "ids" and store them in the new vector "draw":
draw <- ids %>% sample(50)
Step 3: Subset rows in your original data set based on matches with the IDs drawn in step 2.
rsample <- main[main$id %in% draw, ]
Upvotes: 1
Reputation: 1474
library(data.table) #1
df <- data.table(df) #2
df[,group_num := sample(2,.N,replace = TRUE,prob = c(500,.N-500)/.N),by = "ID"] #3
df_sample = df[group_num == 1,] #4
or you can change line #3 and #4 to:
df[,random_num := sample(.N,.N),by="ID"]
df_sample = df[random_num <=500,]
Upvotes: 0
Reputation: 400
Although this is not very elegant solution, but it may work.
library(data.table)
df <- data.table(df)
f <- list()
for(i in unique(df1$ID)){
f[[i]] <- df1[id == i][sample(.N,(500))]
}
dfnew <- rbindlist(f)
Upvotes: 0
Reputation: 193507
Here is one approach in base R.
First, the prerequisite sample data to work with:
set.seed(1)
mydf <- data.frame(ID = rep(1:3, each = 5), matrix(rnorm(45), ncol = 3))
mydf
# ID X1 X2 X3
# 1 1 -0.6264538 -0.04493361 1.35867955
# 2 1 0.1836433 -0.01619026 -0.10278773
# 3 1 -0.8356286 0.94383621 0.38767161
# 4 1 1.5952808 0.82122120 -0.05380504
# 5 1 0.3295078 0.59390132 -1.37705956
# 6 2 -0.8204684 0.91897737 -0.41499456
# 7 2 0.4874291 0.78213630 -0.39428995
# 8 2 0.7383247 0.07456498 -0.05931340
# 9 2 0.5757814 -1.98935170 1.10002537
# 10 2 -0.3053884 0.61982575 0.76317575
# 11 3 1.5117812 -0.05612874 -0.16452360
# 12 3 0.3898432 -0.15579551 -0.25336168
# 13 3 -0.6212406 -1.47075238 0.69696338
# 14 3 -2.2146999 -0.47815006 0.55666320
# 15 3 1.1249309 0.41794156 -0.68875569
Second, the sampling:
do.call(rbind,
lapply(split(mydf, mydf$ID),
function(x) x[sample(nrow(x), 3), ]))
# ID X1 X2 X3
# 1.2 1 0.1836433 -0.01619026 -0.1027877
# 1.1 1 -0.6264538 -0.04493361 1.3586796
# 1.5 1 0.3295078 0.59390132 -1.3770596
# 2.10 2 -0.3053884 0.61982575 0.7631757
# 2.9 2 0.5757814 -1.98935170 1.1000254
# 2.8 2 0.7383247 0.07456498 -0.0593134
# 3.13 3 -0.6212406 -1.47075238 0.6969634
# 3.12 3 0.3898432 -0.15579551 -0.2533617
# 3.15 3 1.1249309 0.41794156 -0.6887557
There is also strata
from the sampling
package, which is convenient when you want to sample different sizes from each group:
# install.packages("sampling")
library(sampling)
set.seed(1)
x <- strata(mydf, "ID", size = c(2, 3, 2), method = "srswor")
getdata(mydf, x)
# X1 X2 X3 ID ID_unit Prob Stratum
# 2 0.1836433 -0.01619026 -0.1027877 1 2 0.4 1
# 5 0.3295078 0.59390132 -1.3770596 1 5 0.4 1
# 6 -0.8204684 0.91897737 -0.4149946 2 6 0.6 2
# 8 0.7383247 0.07456498 -0.0593134 2 8 0.6 2
# 9 0.5757814 -1.98935170 1.1000254 2 9 0.6 2
# 14 -2.2146999 -0.47815006 0.5566632 3 14 0.4 3
# 15 1.1249309 0.41794156 -0.6887557 3 15 0.4 3
Upvotes: 14
Reputation: 15458
mydata1 is your original data(not tested)
mydata2<- split(mydata1,mydata1$ID)
names(mydata2)<-paste0("mydata2",1:length(levels(ID)))
mysample<-Map(function(x) x[sample((1:nrow(x)),size=500,replace=FALSE),], mydata2)
library(plyr)# for rbinding the mysample
ldply(mysample)
Upvotes: 0
Reputation: 109844
An approach if on of the IDs is < 500. Here I used the mtcars set:
n <- 8
df <- mtcars
df$ID <- df$cyl
FUN <- function(x, n) {
if (length(x) <= n) return(x)
x[x %in% sample(x, n)]
}
df[unlist(lapply(split(1:nrow(df), df$ID), FUN, n = 8)), ]
Upvotes: 2
Reputation: 173517
Try this:
library(plyr)
ddply(df,.(ID),function(x) x[sample(nrow(x),500),])
Upvotes: 20