Reputation: 1
Say I have two vectors that represent random variables:
x<-rnorm(100000)
y<-rexp(100000)
What would be the code to compute the correlation coefficient between the two vectors using for
loops?
I am fairly new to R so a simpler answer would be better. Thank you.
Upvotes: 0
Views: 50
Reputation: 76402
There is no reason why anyone would need the function below. It's simply the formula for Pearson correlation implemented in R's arithmetic operations.
cor2 <- function(x, y, na.rm = FALSE){
if(na.rm){
x <- x[!is.na(x)]
y <- y[!is.na(y)]
}
stopifnot(length(x) == length(y))
n <- length(x)
sum.x <- 0
sum.y <- 0
sum.x2 <- 0
sum.y2 <- 0
sum.xy <- 0
for(i in seq_along(x)){
sum.x <- sum.x + x[i]
sum.y <- sum.y + y[i]
sum.x2 <- sum.x2 + x[i]^2
sum.y2 <- sum.y2 + y[i]^2
sum.xy <- sum.xy + x[i]*y[i]
}
numer <- n*sum.xy - sum.x*sum.y
denom <- sqrt(n*sum.x2 - sum.x^2)*sqrt(n*sum.y2 - sum.y^2)
numer/denom
}
set.seed(1234)
x <- rnorm(20)
y <- rexp(20)
cor(x, y)
#[1] -0.07445358
cor2(x, y)
#[1] -0.07445358
These results are not identical()
.
identical(cor(x, y),cor2(x, y))
#[1] FALSE
all.equal(cor(x, y),cor2(x, y))
#[1] TRUE
cor(x, y) - cor2(x, y)
#[1] -4.163336e-17
But the built-in function is much faster.
x <- rnorm(1000)
y <- rexp(1000)
microbenchmark::microbenchmark(
base = cor(x, y),
rui = cor2(x, y)
)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# base 40.936 42.797 45.54924 43.482 44.8155 101.172 100 a
# rui 479.102 481.738 496.01586 483.190 491.5015 690.619 100 b
Upvotes: 2
Reputation: 16178
Hi Joe and welcome to SO !
You don't need a for
loop for that, the cor.test
can give you the correlation coefficient between two vectors.
x<-rnorm(1000)
y<-rexp(1000)
cor.test(x,y)
And you get the following output:
> cor.test(x,y)
Pearson's product-moment correlation
data: x and y
t = 1.5191, df = 998, p-value = 0.1291
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.01400425 0.10969698
sample estimates:
cor
0.04803053
You can also plot it using ggpubr
on a scatter plot:
library(ggpubr)
df = data.frame(x,y)
ggscatter(df, x = "x", y = "y",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson")
And you get the following output:
Doing multiple correlation using or not a for
loop
# generating a dataframe with multiple vector to compare:
df = NULL
for(i in 1:5)
{
df = data.frame(cbind(df,rnorm(1000)))
}
# Testing the correlation between columns 1 and all other columns using for loop
for(i in 1:ncol(df))
{
if(i==1){correlation = cor(df[,1],df[,i])}
else{correlation = c(correlation, cor(df[,1],df[,i]))}
}
> correlation
[1] 1.00000000 -0.05276680 -0.03968104 -0.02960876 0.01861618
# Using apply
correlation = as.vector(apply(df,2,function(x){cor(x,df[,1])}))
> correlation
[1] 1.00000000 -0.05276680 -0.03968104 -0.02960876 0.01861618
Upvotes: 2