Reputation: 2176
I want to change a bit of my code to use plyr instead of ddply because I think it will be faster on my large (>1e6) data set. Here is an example data set:
ID <- rep(1:3, each=6)
Row <- rep(1, each=18)
Col <- rep(rep(1:2, each=3), times=3)
Meas <- rnorm(18,3,1)
len <- rep(1:3, times=6)
df <- data.frame(ID, Row, Col, Meas, len)
The code I normally use is this:
res <- ddply(df, c("ID", "Row", "Col"), function(x) coefficients(lm(Meas~len,x)))
It performs a lm for Meas against len for each subset of df by ID, Row and Col, extracting the coefficients. On my large data set it takes 30 seconds (not the end of the world, I know). When I try plyr with this:
res2 <- df %>% group_by("ID", "Row", "Col") %>% (function(x) coefficients(lm(Meas~len,x))) %>%
as.data.frame()
I only get one intercept and grad. I've read this (extracting p values from multiple linear regression (lm) inside of a ddply function using spatial data) which gave me this attempt:
res3 <- df %>% group_by("ID", "Row", "Col") %>%
do({model=lm(Meas~len, data=.)
data.frame(tidy(model),
glance(model))})
But again no luck. I'm sure I'm missing something simple.
Update:
Out of interest for anyone running a similar thing on large data sets:
system.time(
lres <- ddply(I, c("ERF", "Wafer", "Row", "Col"), function(x) coefficients(lm(Rds.on.fwd~Length,x)))
)
user system elapsed
25.80 0.06 26.02
system.time(
lres2 <- I %>% group_by(ERF, Wafer, Row, Col) %>% do(
as.data.frame.list(coef(lm(Rds.on.fwd~Length, data=.))))
)
user system elapsed
43.12 0.25 44.02
system.time(
lres3 <- setDT(I)[, as.list(coef(lm(Rds.on.fwd~Length))), .(ERF,Wafer, Row, Col)]
)
user system elapsed
19.77 0.05 19.91
so actually @akrun the data.table option is the best so thank you again.
Upvotes: 2
Views: 95
Reputation: 887991
We modify the OP's last piece of code to get the expected output. We group by the variables 'ID', 'Row' and 'Col', do
the lm
using the variables 'Meas' and 'len', extract the coefficients with coef
, convert it to a list
and then to data.frame
(as.data.frame.list
) to create two new columns ('intercept' and 'slope').
df %>%
group_by(ID, Row, Col) %>%
do(as.data.frame.list(coef(lm(Meas~len, data=.))))
Or using data.table
, we convert the 'data.frame' to 'data.table, group by 'ID', 'Row', and 'Col', do the lm
, extract the coefficients, and convert to a list
so that we get two new columns.
library(data.table)
setDT(df)[, as.list(coef(lm(Meas~len))), .(ID, Row, Col)]
Upvotes: 3