slhck
slhck

Reputation: 38652

How do I pass variables to a custom function in ddply?

Consider the following data:

d = data.frame(
    experiment = as.factor(c("foo", "foo", "foo", "bar", "bar")),
    si = runif(5),
    ti = runif(5)
)

I would like to perform a correlation test for si and ti, for each experiment factor level. So I thought I'd run:

ddply(d, .(experiment), cor.test)

But how do I pass the values of si and ti to the cor.test call? I tried this:

> ddply(d, .(experiment), cor.test, x = si, y = ti)
Error in .fun(piece, ...) : object 'si' not found
> ddply(d, .(experiment), cor.test, si, ti)
Error in match.arg(alternative) : 
  'arg' must be NULL or a character vector

Is there anything obvious I'm missing? The plyr documentation does not include any example for me. Most commands I see only involve summarize as the function call, but doing the usual things I was used to doing from summarize don't work, as can be seen above.

Upvotes: 3

Views: 7046

Answers (2)

BrodieG
BrodieG

Reputation: 52637

ddply splits your data frame by the variables you select (experiment here) and then passes the function the resulting subsets of the data frame. In your case your function cor.test doesn't accept a data frame as an input, so you need a translation layer:

d <- data.frame(
  experiment = as.factor(c("foo", "foo", "foo", "bar", "bar", "bar")),
  si = runif(6),
  ti = runif(6)
)
ddply(d, .(experiment), function(d.sub) cor.test(d.sub$si, d.sub$ti)$statistic)
#   experiment         t
# 1        bar 0.1517205
# 2        foo 0.3387682

Also, your output has to be something like a vector or a data frame, which is why I just chose $statistic above, but you could have added multiple variables if you wanted.

Side note, I had to add a value to the input data frame as it cor.test won't run on 2 values (was the case for "bar"). If you want more comprehensive stats, you can try:

ddply(d, .(experiment), function(d.sub) {
  as.data.frame(cor.test(d.sub$si, d.sub$ti)[c("statistic", "parameter", "p.value", "estimate")])
} )
#   experiment statistic parameter   p.value  estimate
# 1        bar 0.1517205         1 0.9041428 0.1500039
# 2        foo 0.3387682         1 0.7920584 0.3208567 

Note that since we're now returning something more complex than just a vector, we need to coerce it to a data.frame. If you want to include more complex values (e.g. the confidence interval, which is a two value result), you would have to simplify them first.

Upvotes: 7

Roland
Roland

Reputation: 132706

You can use summarize for this if you don't mind running cor.test several times for each experiment (i.e., performance isn't an issue).

#note that you need at least 3 value pairs for cor.test
set.seed(42)
d = data.frame(
  experiment = as.factor(c("foo", "foo", "foo", "bar", "bar", "bar")),
  si = runif(6),
  ti = runif(6)
)

library(plyr)
ddply(d, .(experiment), summarize,
      r=cor.test(si, ti)$estimate,
      p=cor.test(si, ti)$p.value
      )

#  experiment           r         p
#1        bar  0.07401492 0.9528375
#2        foo -0.41842834 0.7251622

Upvotes: 1

Related Questions