Reputation: 957
I have a very large data frame (150.000.000 rows) with a format like this:
df = data.frame(pnr = rep(500+2*(1:15),each=3), x = runif(3*15))
pnr is person id and x is some data. I would like to sample 10% of the persons. Is there a fast way to do this in dplyr?
The following is a solution, but it is slow because of the merge-statement
prns = as.data.frame(unique(df$prn))
names(prns)[1] = "prn"
prns$s = rbinom(nrow(prns),1,0.1)
df = merge(df,prns)
df2 = df[df$s==1,]
Upvotes: 1
Views: 1460
Reputation: 193507
I would actually suggest the "data.table" package over "dplyr" for this. Here's an example with some big-ish sample data (not much smaller than your own 15 million rows).
I'll also show some right and wrong ways to do things :-)
Here's the sample data.
library(data.table)
library(dplyr)
library(microbenchmark)
set.seed(1)
mydf <- DT <- data.frame(person = sample(10000, 1e7, TRUE),
value = runif(1e7))
We'll also create a "data.table" and set the key to "person". Creating the "data.table" takes no significant time, but setting the key can.
system.time(setDT(DT))
# user system elapsed
# 0.001 0.000 0.001
## Setting the key takes some time, but is worth it
system.time(setkey(DT, person))
# user system elapsed
# 0.620 0.025 0.646
I can't think of a more efficient way to select your "person" values than the following, so I've removed these from the benchmarks--they are common to all approaches.
## Common to all tests...
A <- unique(mydf$person)
B <- sample(A, ceiling(.1 * length(A)), FALSE)
For convenience, the different tests are presented as functions...
## Base R #1
fun1a <- function() {
mydf[mydf$person %in% B, ]
}
## Base R #2--sometimes using `which` makes things quicker
fun1b <- function() {
mydf[which(mydf$person %in% B), ]
}
## `filter` from "dplyr"
fun2 <- function() {
filter(mydf, person %in% B)
}
## The "wrong" way to do this with "data.table"
fun3a <- function() {
DT[which(person %in% B)]
}
## The "right" (I think) way to do this with "data.table"
fun3b <- function() {
DT[J(B)]
}
Now, we can benchmark:
## The benchmarking
microbenchmark(fun1a(), fun1b(), fun2(), fun3a(), fun3b(), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1a() 382.37534 394.27968 396.76076 406.92431 494.32220 20
# fun1b() 401.91530 413.04710 416.38470 425.90150 503.83169 20
# fun2() 381.78909 394.16716 395.49341 399.01202 417.79044 20
# fun3a() 387.35363 397.02220 399.18113 406.23515 413.56128 20
# fun3b() 28.77801 28.91648 29.01535 29.37596 42.34043 20
Look at the performance we get from using "data.table" the right way! All the other approaches are impressively fast though.
summary
shows the results to be the same. (The row order for the "data.table" solution would be different since it has been sorted.)
summary(fun1a())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
summary(fun2())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
summary(fun3b())
# person value
# Min. : 16 Min. :0.000002
# 1st Qu.:2424 1st Qu.:0.250988
# Median :5075 Median :0.500259
# Mean :4958 Mean :0.500349
# 3rd Qu.:7434 3rd Qu.:0.749601
# Max. :9973 Max. :1.000000
Upvotes: 4
Reputation: 2414
If you don't necessarily want a thoroughly random sample, then you could do
filter(df, pnr %% 10 ==0).
Which would take every 10th person (you could get 10 different samples by changing to ==1
,...). You could make this random by re-allocating IDs randomly - fairly trivial to do this using sample(15)[(df$pnr-500)/2]
for your toy example - reversing the mapping of pnr
onto a set that's suitable for sample
might be less easy for the real-world case.
Upvotes: 1
Reputation: 99321
In base R, to sample 10% of the rows, rounding up to the next row
> df[sample(nrow(df), ceiling(0.1*nrow(df)), FALSE), ]
## pnr x
## 16 512 0.9781232
## 21 514 0.5279925
## 33 522 0.8332834
## 14 510 0.7989481
## 4 504 0.7825318
or rounding down to the next row
> df[sample(nrow(df), floor(0.1*nrow(df)), FALSE), ]
## pnr x
## 43 530 0.449985180
## 35 524 0.996350657
## 2 502 0.499871966
## 25 518 0.005199058
or sample 10% of the pnr
column, rounding up
> sample(df$pnr, ceiling(0.1*length(df$pnr)), FALSE)
## [1] 530 516 526 518 514
ADD:
If you're looking to sample 10% of the people (unique pnr
ID), and return those people and their respective data, I think you want
> S <- sample(unique(df$pnr), ceiling(0.1*length(unique(df$pnr))), FALSE)
> df[df$pnr %in% S, ]
## pnr x
## 1 502 0.7630667
## 2 502 0.4998720
## 3 502 0.4839460
## 22 516 0.8248153
## 23 516 0.5795991
## 24 516 0.1572472
PS: I would wait for a dplyr
answer. It will likely be quicker on 15mil rows.
Upvotes: 4