Joe Jewels
Joe Jewels

Reputation: 1

Basic Stats Question Regarding Coefficient Correlation and For Loops

Say I have two vectors that represent random variables:

x<-rnorm(100000)
y<-rexp(100000)

What would be the code to compute the correlation coefficient between the two vectors using for loops?

I am fairly new to R so a simpler answer would be better. Thank you.

Upvotes: 0

Views: 50

Answers (2)

Rui Barradas
Rui Barradas

Reputation: 76402

There is no reason why anyone would need the function below. It's simply the formula for Pearson correlation implemented in R's arithmetic operations.

cor2 <- function(x, y, na.rm = FALSE){
  if(na.rm){
    x <- x[!is.na(x)]
    y <- y[!is.na(y)]
  }
  stopifnot(length(x) == length(y))
  n <- length(x)
  sum.x <- 0
  sum.y <- 0
  sum.x2 <- 0
  sum.y2 <- 0
  sum.xy <- 0
  for(i in seq_along(x)){
    sum.x <- sum.x + x[i]
    sum.y <- sum.y + y[i]
    sum.x2 <- sum.x2 + x[i]^2
    sum.y2 <- sum.y2 + y[i]^2
    sum.xy <- sum.xy + x[i]*y[i]
  }
  numer <- n*sum.xy - sum.x*sum.y
  denom <- sqrt(n*sum.x2 - sum.x^2)*sqrt(n*sum.y2 - sum.y^2)
  numer/denom
}

set.seed(1234)

x <- rnorm(20)
y <- rexp(20)

cor(x, y)
#[1] -0.07445358
cor2(x, y)
#[1] -0.07445358

These results are not identical().

identical(cor(x, y),cor2(x, y))
#[1] FALSE
all.equal(cor(x, y),cor2(x, y))
#[1] TRUE
cor(x, y) - cor2(x, y)
#[1] -4.163336e-17

But the built-in function is much faster.

x <- rnorm(1000)
y <- rexp(1000)

microbenchmark::microbenchmark(
  base = cor(x, y),
  rui = cor2(x, y)
)
#Unit: microseconds
# expr     min      lq      mean  median       uq     max neval cld
# base  40.936  42.797  45.54924  43.482  44.8155 101.172   100  a 
#  rui 479.102 481.738 496.01586 483.190 491.5015 690.619   100   b

Upvotes: 2

dc37
dc37

Reputation: 16178

Hi Joe and welcome to SO !

You don't need a for loop for that, the cor.test can give you the correlation coefficient between two vectors.

x<-rnorm(1000) 
y<-rexp(1000)
cor.test(x,y)

And you get the following output:

> cor.test(x,y)

    Pearson's product-moment correlation

data:  x and y
t = 1.5191, df = 998, p-value = 0.1291
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.01400425  0.10969698
sample estimates:
       cor 
0.04803053 

You can also plot it using ggpubr on a scatter plot:

library(ggpubr)
df = data.frame(x,y)
ggscatter(df, x = "x", y = "y", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson")

And you get the following output:

enter image description here

Doing multiple correlation using or not a for loop

# generating a dataframe with multiple vector to compare:
df = NULL 
for(i in 1:5)
{
  df = data.frame(cbind(df,rnorm(1000)))
}

# Testing the correlation between columns 1 and all other columns using for loop
for(i in 1:ncol(df))
{
  if(i==1){correlation = cor(df[,1],df[,i])}
  else{correlation = c(correlation, cor(df[,1],df[,i]))}
}

> correlation
[1]  1.00000000 -0.05276680 -0.03968104 -0.02960876  0.01861618

# Using apply
correlation = as.vector(apply(df,2,function(x){cor(x,df[,1])}))

> correlation
[1]  1.00000000 -0.05276680 -0.03968104 -0.02960876  0.01861618

Upvotes: 2

Related Questions