Reputation: 141
I am trying to calculate the correlation between two numeric columns in a data frame for each level of a factor. Here is an example data frame:
concentration <-(c(3, 8, 4, 7, 3, 1, 3, 3, 8, 6))
area <-c(0.5, 0.9, 0.3, 0.4, 0.5, 0.8, 0.9, 0.2, 0.7, 0.7)
area_type <-c("A", "B", "A", "B", "A", "B", "A", "B", "A", "B")
data_frame <-data.frame(concentration, area, area_type)
In this example, I want to calculate the correlation between concentration and area for each level of area_type. I want to use cor.test rather than cor because I want p-values and kendall tau values. I have tried to do this using ddply:
ddply(data_frame, "area_type", summarise,
corr=(cor.test(data_frame$area, data_frame$concentration,
alternative="two.sided", method="kendall") ) )
However, I am having a problem with the output: it is organized differently from the normal Kendall cor.test output, which states z value, p-value, alternative hypothesis, and tau estimate. Instead of that, I get the output below. I don't know what each row of the output indicates. In addition, the output values are the same for each level of area_type.
area_type corr
1 A 0.3766218
2 A NULL
3 A 0.7064547
4 A 0.1001252
5 A 0
6 A two.sided
7 A Kendall's rank correlation tau
8 A data_frame$area and data_frame$concentration
9 B 0.3766218
10 B NULL
11 B 0.7064547
12 B 0.1001252
13 B 0
14 B two.sided
15 B Kendall's rank correlation tau
16 B data_frame$area and data_frame$concentration
What am I doing wrong with ddply? Or are there other ways of doing this? Thanks.
Upvotes: 2
Views: 7700
Reputation: 2244
You can add an additional column with the names of corr. Also, your syntax is slightly incorrect. The .
specifies that the variable is from the data frame you've specified. Then remove the data_frame$ or else it will use the entire data frame:
ddply(data_frame, .(area_type), summarise,
corr=(cor.test(area, concentration,
alternative="two.sided", method="kendall")), name=names(corr) )
Which gives:
area_type corr name
1 A -0.285133 statistic
2 A NULL parameter
3 A 0.7755423 p.value
4 A -0.1259882 estimate
5 A 0 null.value
6 A two.sided alternative
7 A Kendall's rank correlation tau method
8 A area and concentration data.name
9 B 6 statistic
10 B NULL parameter
11 B 0.8166667 p.value
12 B 0.2 estimate
13 B 0 null.value
14 B two.sided alternative
15 B Kendall's rank correlation tau method
16 B area and concentration data.name
statistic is the z-value and estimate is the tau estimate.
EDIT: You can also do it like this to only pull what you want:
corfun<-function(x, y) {
corr=(cor.test(x, y,
alternative="two.sided", method="kendall"))
}
ddply(data_frame, .(area_type), summarise,z=corfun(area,concentration)$statistic,
pval=corfun(area,concentration)$p.value,
tau.est=corfun(area,concentration)$estimate,
alt=corfun(area,concentration)$alternative
)
Which gives:
area_type z pval tau.est alt
1 A -0.285133 0.7755423 -0.1259882 two.sided
2 B 6.000000 0.8166667 0.2000000 two.sided
Upvotes: 8
Reputation: 544
Part of the reason this is not working is the cor.test returns:
Pearson's product-moment correlation
data: data_frame$concentration and data_frame$area
t = 0.5047, df = 8, p-value = 0.6274
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.5104148 0.7250936
sample estimates:
cor
0.1756652
This information cannot be put into a data.frame (which ddply does) without future complicating the code. If you can provide the exact information you need then I can provide further assistance. I would look at just using
corrTest <- ddply(.data = data_frame,
.variables = .(area_type),
.fun = cor(concentration, area,))
method="kendall")))
I haven't test this code but this is the route I would take initially and work from here.
Upvotes: 0