Reputation: 173
I have a data table like below.
data table1:
Cty LR1998 LR1999....LR2018 SM1998 SM1999 ..... SM2018
1 0.1 2.5 .... 5.2 14.52 58.62 ..... 69.2
7 0 50.2 ... 10.6 25.5 80.2 ..... 80.5
.
.
I need to find the correlation (pearson) for each "Cty" for the time period from 1998 to 2018 using the LR and SM variables for this time period.
I would also like to create a nice correlation plot as well.
Can somebody guide me how to perform this using R? I know how to use "cor"function but do not know how to use that for a large data table.
Example Output:
Cty Cor p-value
1 0.16 0.125
7 0.32 <0.05
Thanks a lot.
Upvotes: 0
Views: 965
Reputation: 160607
To get correlation between countries, it might help to reshape the data. Though there's a base reshape
function, I've always found it confusing, so I often use tidyr
(or data.table
) for reshaping operations.
First, some data:
set.seed(2)
n <- 6
x <- cbind.data.frame(
CTY = LETTERS[seq_len(n)],
sapply(paste0("LR", 2000:2006), function(ign) runif(n)),
stringsAsFactors=FALSE
)
Reshaping, so that each country is unique vertically (removing my unnecessarily-complicated tidyr
method with the should-have-been-obvious t
, thanks pooja p):
y <- `colnames<-`(t(x[-1]), x[[1]])
cor(y)
(The biggest difference between my now-removed gather
/spread
approach is that this is still a matrix
, trivially converted with as.data.frame
if needed. It should not be necessary here.)
Now the correlation is rather straight-forward.
cor(y)
# A B C D E F
# A 1.000 -0.435 -0.1443 -0.2010 0.736 -0.471
# B -0.435 1.000 -0.3825 -0.3493 -0.145 0.279
# C -0.144 -0.382 1.0000 0.0885 -0.426 0.447
# D -0.201 -0.349 0.0885 1.0000 -0.523 -0.128
# E 0.736 -0.145 -0.4261 -0.5232 1.000 -0.121
# F -0.471 0.279 0.4467 -0.1279 -0.121 1.000
With the data in this format, adapting to the use of cor.test
is a little more work. I'll lean on more of the tidyverse
for this:
# library(purrr)
crossing(a=x$CTY, b=x$CTY) %>%
rowwise() %>%
do(bind_cols(.,
purrr::map2_dfc(.$a, .$b,
~ as.data.frame(cor.test(y[,.x], y[,.y])[c("estimate", "p.value")]))
)) %>%
ungroup()
# # A tibble: 36 x 4
# a b estimate p.value
# * <chr> <chr> <dbl> <dbl>
# 1 A A 1 0
# 2 A B -0.435 0.329
# 3 A C -0.144 0.757
# 4 A D -0.201 0.666
# 5 A E 0.736 0.0591
# 6 A F -0.471 0.286
# 7 B A -0.435 0.329
# 8 B B 1 0
# 9 B C -0.382 0.397
# 10 B D -0.349 0.443
# # ... with 26 more rows
Since your correlation tests are associative (the same as a,b
and b,a
), then we can remove the duplicates and run tests with:
crossing(a=x$CTY, b=x$CTY) %>%
transmute(a1 = pmin(a,b), b1 = pmax(a,b)) %>%
distinct() %>%
rowwise() %>%
do(bind_cols(.,
purrr::map2_dfc(.$a1, .$b1,
~ as.data.frame(cor.test(y[,.x], y[,.y])[c("estimate", "p.value")]))
)) %>%
ungroup()
# # A tibble: 21 x 4 ...
Upvotes: 0
Reputation: 144
For the correlation plot, you can try this: Transpose the data frame and use the 'pairs' function
pairs(t(table1[-1]))
Upvotes: 1