baha-kev
baha-kev

Reputation: 3059

plyr with nested groups?

Is there an eloquent way to use ddply() to obtain output for not only the most granular groups defined, but also the groups of those sub-groups?

In other words, when one of the classifiers is "any" or "either" or "doesn't matter". In the simple case of two grouping variables, this can be accomplished by a separate call to ddply; however, when there are three or more classifiers that can all be set to "any" this gets messy due having to run ddply over and over again for every new combination of "any"+others.

Reproducible example:

require(plyr)

## create a data frame with three classification variables
## and two numeric variables:
df1=data.frame(classifier1 = LETTERS[sample(2,200,replace=T)],
classifier2 = letters[sample(3,200,replace=T)],
classifier3 = rep(c("foo","bar"),100),
VAR1 = runif(200,50,250),
VAR2 = rnorm(200,85,20))

## apply an arbitrary function to subsets of df1; that is, all unique
## combinations of the three classifiers.
dlply(df1, .(classifier1,classifier2,classifier3),
      function(df) lm(VAR1 ~ VAR2, data=df))

$A.a.bar

Call:
lm(formula = VAR1 ~ VAR2, data = df)

Coefficients:
(Intercept)         VAR2  
   230.5555      -0.8591  


$A.a.foo

Call:
lm(formula = VAR1 ~ VAR2, data = df)

Coefficients:
(Intercept)         VAR2  
   128.3078       0.3631  

...

Now, what if I want to get the same output for a few more groups when any/all classifiers are not included. For example, if I wanted to include when classifier1="any", I would only include classifier2 and classifier3 in the dlply statement, like this:

dlply(df1, .(classifier2,classifier3), function(df) lm(VAR1 ~ VAR2, data=df))

If I then wanted to get output for when classifier2 and classifier3="any", I would again delete from the ddply call and only include classifier1:

dlply(df1, .(classifier1), function(df) lm(VAR1 ~ VAR2, data=df))

However, this gets unwieldy when I have many more classifiers than three, and each classifier can be taken out (i.e. = "any") -- the number of combinations increases substantially. Is there an eloquent/fast way to obtain output for all the "groups of groups" of my data?

Upvotes: 3

Views: 566

Answers (1)

mnel
mnel

Reputation: 115382

One approach would be to create a list of the combinations and then use Map to create a list of the results of each dlply call

You can use combn in combination with lapply and do.call('c',...) to create a list of all the combinations of 1,2, ...,n variables

xx <- do.call('c',lapply(1:3, function(m) {
           combn(x=names(df1)[1:3],m, simplify = FALSE)}))

You can then use this in a call to Map (which is a wrapper for mapply(..., SIMPLIFY = FALSE)

results <- Map(f = function(x){dlply(df1,.var=x, .fun = lm, formula = VAR1 ~ VAR2)},xx)

Or you could just pass a function to combn -- which will do the same thing

results <-  do.call('c',lapply(1:3, function(m) {
  combn(x=names(df1)[1:3],m, simplify = FALSE, 
      function(vv) {dlply(df1,.var=vv, .fun = lm, formula = VAR1~VAR2)})
   }))

Upvotes: 4

Related Questions