Reputation: 69
I am very new to R.
I am interested in calculating Pearson Correlations for my data. I have successfully figured out how to calculate the correlation of two continuous variables within my data set, x and y; however, I am hoping to "stratify" the correlations by a third, categorical variable: state. I would like to be able to say "the correlation co-efficient/p-value of x and y is [Result] in [State]."
I have tried the group_by method, located in the dplyr package, housed within the cor.test (shown below). I am in need of both the coefficients and the p-values, so I have been trying to use the cor.test method. I have also tried using a matrix method, but was unsuccessful that way as well.
Data<-read.csv("PATHWAYNAME")
library(dplyr)
CCor<-cor.test(Data$x, Data$y,
method=c("pearson"), group_by(State))
CCor
I am able to run each set of values for each state individually to get the coefficients and p-values; however, I am certain that there is a more efficient way to complete this task. My data is large enough that it will be beyond tedious to run them individually.
Thank you in advance for your help!
UPDATE: Using this as a sample data set that is extremely truncated, but similarly represents the variables in my own, I would like to know if the average income correlates with the number of visits in each state listed; that is, does the average income have a positive or negative correlation with the number of visits in the state of Alabama?
>State NumVis AvgIncome
>IN 45 60000
>AL 100 56000
>AK 45 80000
>ME 89 54000
>NC 120 100000
>SC 356 43000
>ND 100 25000
>SD 63 20000
>MN 54 46000
>ID 85 55000
When running this data using the code indicated below, my outcome is this:
CorrDat<-read.csv("File")
CorrDat %>%
group_by(State) %>%
do(tidy(cor.test(CorrDat$NumVis, CorrDat$Income, method="pearson")))
Would you be able to help clarify what it is that I am doing incorrectly with this code or if I need to use an alternative method to accomplish this task?
Upvotes: 4
Views: 2430
Reputation: 19716
There are several ways this can be accomplished in R. dplyr
or more generally tidyverse
are a popular group of tools that are able to achieve the desired result. The key difference in these tools is the pipe %>%
which provides a means to write code from left to right instead of from the inside out (or to make a bunch of intermediary objects in the environment). Although the pipe can be used with base R its popularity came with dplyr
.
Here are several examples on mtcars data set. The key functions are do
and map
which are quite versatile. I suggest running ?do
and ?map
.
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarize(cor = cor(mpg, disp))
#output
# A tibble: 3 x 2
cyl correlation
<dbl> <dbl>
1 4 -0.8052361
2 6 0.1030827
3 8 -0.5197670
another way is:
mtcars %>%
group_by(cyl) %>%
do(cor = cor(.$mpg, .$disp)) %>%
unnest()
or for more variables:
mtcars %>%
group_by(cyl) %>%
do(cor = as.data.frame(cor(.[,-2]) )) %>%
unnest()
an example with cor.test:
library(broom)
mtcars %>%
group_by(cyl) %>%
do(tidy(cor.test(.$mpg, .$disp)))
#output
cyl estimate statistic p.value parameter conf.low conf.high method alternative
<dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <fctr> <fctr>
1 4 -0.8052361 -4.0740206 0.002782827 9 -0.9474526 -0.39724826 Pearson's product-moment correlation two.sided
2 6 0.1030827 0.2317344 0.825929685 5 -0.7046776 0.79446840 Pearson's product-moment correlation two.sided
3 8 -0.5197670 -2.1075838 0.056774876 12 -0.8232990 0.01492976 Pearson's product-moment correlation two.sided
and yet another way using purrr::map:
mtcars %>%
split(.$cyl) %>%
map(~cor.test(.x$mpg, .x$disp))
which gives a list, which can be manipulated with the same or another map function:
mtcars %>%
split(.$cyl) %>%
map(~cor.test(.x$mpg, .x$disp)) %>%
map_dbl("p.value")
#output:
4 6 8
0.002782827 0.825929685 0.056774876
to extract the coefficients:
mtcars %>%
split(.$cyl) %>%
map(~cor.test(.x$mpg, .x$disp)) %>%
map(~data.frame(cor = .x$estimate, p = .x$p.value)) #check also `map_dfr` and `map_dfc`
#output
$`4`
cor p
cor -0.8052361 0.002782827
$`6`
cor p
cor 0.1030827 0.8259297
$`8`
cor p
cor -0.519767 0.05677488
UPDATE: answer to the updated question:
The problem is how you specify the do
call. This is correct:
df %>%
group_by(State) %>%
do(tidy(cor.test(.$NumVis, .$AvgIncome, method="pearson")))
where the .
represents the data passed by the previous pipe. In the example posted this results in:
Error in cor.test.default(.$NumVis, .$AvgIncome, method = "pearson") :
not enough finite observations
which is reasonable considering only 1 observation per group is suplied
what you did is:
CorrDat<-read.csv("File")
CorrDat %>%
group_by(State) %>%
do(tidy(cor.test(CorrDat$NumVis, CorrDat$Income, method="pearson")))
passing the whole CorrDat set to the do
function so it performed the same operation as many times as there are groups.
The %>%
pipe assumes the data passed will be used as the first argument in the following function, if it should not the data can be refereed to as .
. You can perform operations like .$column
or .[,2]
etc.
Upvotes: 6
Reputation: 5893
With base r, you could use by
.
E.g., replicating one of the examples in missuse's post:
do.call(rbind,
by(mtcars, mtcars$cyl, FUN = function(x) cor.test(x$mpg, x$disp, data = x)))
statistic parameter p.value estimate null.value alternative method data.name conf.int
4 -4.074021 9 0.002782827 -0.8052361 0 "two.sided" "Pearson's product-moment correlation" "x$mpg and x$disp" Numeric,2
6 0.2317344 5 0.8259297 0.1030827 0 "two.sided" "Pearson's product-moment correlation" "x$mpg and x$disp" Numeric,2
8 -2.107584 12 0.05677488 -0.519767 0 "two.sided" "Pearson's product-moment correlation" "x$mpg and x$disp" Numeric,2
Upvotes: 0