user9165024
user9165024

Reputation: 69

Pearson Correlation "Stratified" by Categorical Variable

I am very new to R.

I am interested in calculating Pearson Correlations for my data. I have successfully figured out how to calculate the correlation of two continuous variables within my data set, x and y; however, I am hoping to "stratify" the correlations by a third, categorical variable: state. I would like to be able to say "the correlation co-efficient/p-value of x and y is [Result] in [State]."

I have tried the group_by method, located in the dplyr package, housed within the cor.test (shown below). I am in need of both the coefficients and the p-values, so I have been trying to use the cor.test method. I have also tried using a matrix method, but was unsuccessful that way as well.

Data<-read.csv("PATHWAYNAME")
  library(dplyr)
  CCor<-cor.test(Data$x, Data$y,
          method=c("pearson"), group_by(State))
  CCor

I am able to run each set of values for each state individually to get the coefficients and p-values; however, I am certain that there is a more efficient way to complete this task. My data is large enough that it will be beyond tedious to run them individually.

Thank you in advance for your help!

UPDATE: Using this as a sample data set that is extremely truncated, but similarly represents the variables in my own, I would like to know if the average income correlates with the number of visits in each state listed; that is, does the average income have a positive or negative correlation with the number of visits in the state of Alabama?

>State  NumVis  AvgIncome
>IN       45        60000
>AL       100       56000
>AK       45        80000
>ME       89        54000
>NC       120       100000
>SC       356       43000
>ND       100       25000
>SD       63        20000
>MN       54        46000
>ID       85        55000

When running this data using the code indicated below, my outcome is this:

 CorrDat<-read.csv("File")
     CorrDat %>%
       group_by(State) %>%
        do(tidy(cor.test(CorrDat$NumVis, CorrDat$Income, method="pearson")))

Results

Would you be able to help clarify what it is that I am doing incorrectly with this code or if I need to use an alternative method to accomplish this task?

Upvotes: 4

Views: 2430

Answers (2)

missuse
missuse

Reputation: 19716

There are several ways this can be accomplished in R. dplyr or more generally tidyverse are a popular group of tools that are able to achieve the desired result. The key difference in these tools is the pipe %>% which provides a means to write code from left to right instead of from the inside out (or to make a bunch of intermediary objects in the environment). Although the pipe can be used with base R its popularity came with dplyr.

Here are several examples on mtcars data set. The key functions are do and map which are quite versatile. I suggest running ?do and ?map.

library(tidyverse)

mtcars %>%
  group_by(cyl) %>%
  summarize(cor = cor(mpg, disp))
#output
# A tibble: 3 x 2
    cyl correlation
  <dbl>       <dbl>
1     4  -0.8052361
2     6   0.1030827
3     8  -0.5197670

another way is:

mtcars  %>% 
  group_by(cyl) %>%
  do(cor = cor(.$mpg, .$disp)) %>%
  unnest()

or for more variables:

mtcars  %>% 
  group_by(cyl) %>%
  do(cor = as.data.frame(cor(.[,-2]) )) %>%
  unnest() 

an example with cor.test:

library(broom)

mtcars  %>% 
  group_by(cyl)  %>% 
  do(tidy(cor.test(.$mpg, .$disp))) 
#output
    cyl   estimate  statistic     p.value parameter   conf.low   conf.high                               method alternative
  <dbl>      <dbl>      <dbl>       <dbl>     <int>      <dbl>       <dbl>                               <fctr>      <fctr>
1     4 -0.8052361 -4.0740206 0.002782827         9 -0.9474526 -0.39724826 Pearson's product-moment correlation   two.sided
2     6  0.1030827  0.2317344 0.825929685         5 -0.7046776  0.79446840 Pearson's product-moment correlation   two.sided
3     8 -0.5197670 -2.1075838 0.056774876        12 -0.8232990  0.01492976 Pearson's product-moment correlation   two.sided

and yet another way using purrr::map:

mtcars  %>% 
  split(.$cyl)  %>% 
  map(~cor.test(.x$mpg, .x$disp))

which gives a list, which can be manipulated with the same or another map function:

mtcars  %>% 
  split(.$cyl)  %>% 
  map(~cor.test(.x$mpg, .x$disp)) %>%
  map_dbl("p.value")
#output:
          4           6           8 
0.002782827 0.825929685 0.056774876 

to extract the coefficients:

mtcars  %>% 
  split(.$cyl)  %>% 
  map(~cor.test(.x$mpg, .x$disp)) %>%
  map(~data.frame(cor = .x$estimate, p = .x$p.value)) #check also `map_dfr` and `map_dfc`

#output
$`4`
           cor           p
cor -0.8052361 0.002782827

$`6`
          cor         p
cor 0.1030827 0.8259297

$`8`
          cor          p
cor -0.519767 0.05677488

UPDATE: answer to the updated question:

The problem is how you specify the do call. This is correct:

df %>%
  group_by(State) %>%
  do(tidy(cor.test(.$NumVis, .$AvgIncome, method="pearson")))

where the . represents the data passed by the previous pipe. In the example posted this results in:

Error in cor.test.default(.$NumVis, .$AvgIncome, method = "pearson") : 
not enough finite observations

which is reasonable considering only 1 observation per group is suplied

what you did is:

CorrDat<-read.csv("File")
     CorrDat %>%
       group_by(State) %>%
        do(tidy(cor.test(CorrDat$NumVis, CorrDat$Income, method="pearson")))

passing the whole CorrDat set to the do function so it performed the same operation as many times as there are groups.

The %>% pipe assumes the data passed will be used as the first argument in the following function, if it should not the data can be refereed to as .. You can perform operations like .$column or .[,2] etc.

Upvotes: 6

erocoar
erocoar

Reputation: 5893

With base r, you could use by.

E.g., replicating one of the examples in missuse's post:

do.call(rbind,
        by(mtcars, mtcars$cyl, FUN = function(x) cor.test(x$mpg, x$disp, data = x)))

statistic parameter p.value     estimate   null.value alternative method                                 data.name          conf.int 
4 -4.074021 9         0.002782827 -0.8052361 0          "two.sided" "Pearson's product-moment correlation" "x$mpg and x$disp" Numeric,2
6 0.2317344 5         0.8259297   0.1030827  0          "two.sided" "Pearson's product-moment correlation" "x$mpg and x$disp" Numeric,2
8 -2.107584 12        0.05677488  -0.519767  0          "two.sided" "Pearson's product-moment correlation" "x$mpg and x$disp" Numeric,2

Upvotes: 0

Related Questions