Reputation: 182
I've been using dtplyr
to speed up an overly complex dplyr
code, and so far it's been excellent, apart from one issue I can't seem to resolve.
The problem is pretty straight forward to solve in both dplyr
and data.table
, but I can't see a way of applying it to a dtplyr_step
object from lazy_dt()
without using collect()
or converting it back to a data.frame
.
I'm trying to group a dataframe by one column, and sample rows n times based on values in another column.
Here's a working example in dplyr
:
library(dplyr)
df <- data.frame(id=c("a","a","a","b","b","b","c","c","c","d","d","d"),
count=sample(1:25, 12, replace=TRUE))
df %>% group_by(id) %>% sample_n(max(count), replace = TRUE)
and in data.table
:
library(data.table)
dt <- data.table(id=c("a","a","a","b","b","b","c","c","c","d","d","d"),
count=sample(1:25, 12, replace=TRUE))
dt[,.SD[sample(.N, max(count,.N), replace=TRUE)],by = id]
However, attempting both approaches used on an identical "lazy" data.table
created with lazy_dt()
from the dtplyr
package:
library(dtplyr)
df2 <- lazy_dt(df)
df2 %>% group_by(id) %>% sample_n(max(count), replace = TRUE)
fails with Error in max(count) : invalid 'type' (closure) of argument
df2[,.SD[sample(.N, max(count,.N), replace=TRUE)],by = id]
fails with Error in max(count, .N) : invalid 'type' (closure) of argument
Presumably because the count
column is no longer recognised as numeric.
Is there a way of doing this in dtplyr
without converting this back to a data.frame or data.table
(other than recoding the original dplyr
code to data.table
entirely?)
Upvotes: 0
Views: 149