dorien
dorien

Reputation: 5407

Finding non-linear correlations in R

I have about 90 variables stored in data[2-90]. I suspect about 4 of them will have a parabola-like correlation with data[1]. I want to identify which ones have the correlation. Is there an easy and quick way to do this?

I have tried building a model like this (which I could do in a loop for each variable i = 2:90):

y <- data$AvgRating
x <- data$Hamming.distance
x2 <- x^2

quadratic.model = lm(y ~ x + x2)

And then look at the R^2/coefficient to get an idea of the correlation. Is there a better way of doing this?

Maybe R could build a regression model with the 90 variables and chose the ones which are significant itself? Would that be in any way possible? I can do this in JMP for linear regression, but I'm not sure I could do non-linear regression with R for all the variables at ones. Therefore I was manually trying to see if I could see which ones are correlated in advance. It would be helpful if there was a function to use for that.

Upvotes: 8

Views: 10590

Answers (3)

vahab najari
vahab najari

Reputation: 151

You can use nlcor package in R. This package finds the nonlinear correlation between two data vectors. There are different approaches to estimate a nonlinear correlation, such as infotheo. However, nonlinear correlations between two variables can take any shape.

nlcor is robust to most nonlinear shapes. It works pretty well in different scenarios.

At a high level, nlcor works by adaptively segmenting the data into linearly correlated segments. The segment correlations are aggregated to yield the nonlinear correlation. The output is a number between 0 to 1. With close to 1 meaning high correlation. Unlike a pearson correlation, negative values are not returned because it has no meaning in nonlinear relationships.

More details about this package here

To install nlcor, follow these steps:

install.packages("devtools") 
library(devtools)
install_github("ProcessMiner/nlcor")
library(nlcor)

After you install it,

# Implementation 
x <- seq(0,3*pi,length.out=100)
y <- sin(x)
plot(x,y,type="l")

sin(x) plot

# linear correlation is small
cor(x,y)
# [1] 6.488616e-17
# nonlinear correlation is more representative
nlcor(x,y, plt = T)
# $cor.estimate
# [1] 0.9774
# $adjusted.p.value
# [1] 1.586302e-09
# $cor.plot

using nlcor for sin(x)

As shown in the example the linear correlation was close to zero although there was a clear relationship between the variables that nlcor could detect.

Note: The order of x and y inside the nlcor is important. nlcor(x,y) is different from nlcor(y,x). The x and y here represent 'independent' and 'dependent' variables, respectively.

Upvotes: 9

Keith Hughitt
Keith Hughitt

Reputation: 4970

Another option would be to compute mutual information score between each pair of variables. For example, using the mutinformation function from the infotheo package, you could do:

set.seed(1)

library(infotheo)

# corrleated vars (x & y correlated, z noise)
x <- seq(-10,10, by=0.5)
y <- x^2
z <- rnorm(length(x))

# list of vectors
raw_dat <- list(x, y, z)


# convert to a dataframe and discretize for mutual information
dat <- matrix(unlist(raw_dat), ncol=length(raw_dat))
dat <- discretize(dat)

mutinformation(dat)

Result:

|   |        V1|        V2|        V3|                                                                                            
|:--|---------:|---------:|---------:|                                                                                            
|V1 | 1.0980124| 0.4809822| 0.0553146|                                                                                            
|V2 | 0.4809822| 1.0943907| 0.0413265|                                                                                            
|V3 | 0.0553146| 0.0413265| 1.0980124| 

By default, mutinformation() computes the discrete empirical mutual information score between two or more variables. The discretize() function is necessary if you are working with continuous data transform the data to discrete values.

This might be helpful at least as a first stab for looking for nonlinear relationships between variables, such as that described above.

Upvotes: 1

Yorgos
Yorgos

Reputation: 30485

Fitting a generalized additive model, will help you identify curvature in the relationships between the explanatory variables. Read the example on page 22 here.

Upvotes: 2

Related Questions