Reputation: 421
I want to use ddply
(package plyr) with cor to compute Pearson correlations split by a factor ("Plot"). I can successfully do that when columns are passed to cor as column names, but not when passed by column number.
The date frame:
head(chlor2013.df)
Plot X645 X665 Chlorophyll
1 1 0.019 0.054 0.3647
2 1 0.061 0.170 1.1588
3 1 0.021 0.054 0.3827
4 2 0.033 0.092 0.6270
5 2 0.055 0.148 1.0259
6 2 0.018 0.045 0.3234
Using ddply
and cor
, and column names of the data frame:
ddply(chlor2013.df, .(Plot), summarize, cor.v2.v3 = cor(X645,X665, use="complete.obs"))
Plot cor.v2.v3
1 1 0.9610698
2 2 0.9261662
3 3 0.9191197
4 4 0.9104561
5 5 0.9541877
6 6 0.8750801
7 7 0.9949413
Notice that each row shows a unique correlation value. The above is what I want.
Using ddply
and cor
, and column numbers of the data frame:
ddply(chlor2013.df, .(Plot), summarize, cor.v2.v3 = cor(chlor2013.df[2:3],
use="complete.obs"))
Plot cor.v2.v3.1 cor.v2.v3.2
1 1 1.0000000 0.9698445
2 1 0.9698445 1.0000000
3 2 1.0000000 0.9698445
4 2 0.9698445 1.0000000
5 3 1.0000000 0.9698445
6 3 0.9698445 1.0000000
7 4 1.0000000 0.9698445
8 4 0.9698445 1.0000000
9 5 1.0000000 0.9698445
10 5 0.9698445 1.0000000
11 6 1.0000000 0.9698445
12 6 0.9698445 1.0000000
13 7 1.0000000 0.9698445
Now all the r values are identical, and represent the correlation of the two columns when not split by the factor. So the column number syntax works different from the column name syntax. What am I missing?
Ultimately, I want to compute the correlation matrix for all the three variables: X645, X665, and Chlorophyll, split by Plot.
Thanks
Upvotes: 2
Views: 1145
Reputation: 67778
You need to refer to each subset of 'chlor2013.df' by using an anonymous function. In your original attempt, an identical data set, chlor2013.df[2:3]
, was used in all the calculations for each level of 'Plot'. Also note that cor(df[2:3])
is not the same as cor(df[2], df[3])
(compare with your first call: cor(X645,X665
)
ddply(df, .(Plot), function(x) cor.v2.v3 = cor(x[2], x[3], use = "complete.obs"))
Update following comment
In the example above, cor is fed with two numeric vectors, 'X645' and 'X665'. You can also use a numeric matrix or data frame as input to create a "Correlation Matrix of Multivariate sample" (please see ?cor
, e.g. cor(longley)
).
# refering to variables by index
ddply(df, .(Plot), function(x) cor.v2.v3 = cor(x[2:4], use = "complete.obs"))
# refering to variables by name (better practice)
ddply(df, .(Plot), function(x) cor.v2.v3 = cor(x[ , c("X645", "X665", "Chlorophyll")], use = "complete.obs"))
Upvotes: 3