How to find correlations in a dataset containing over 350 columns in R

Question

I have a dataset with ~360 measurement types listed as columns and has 200 rows each with unique ID.

+-----+-------+--------+--------+---------+---------+---------+---+---------+
|     |  ID   |   M1   |   M2   |   M3    |   M4    |   M5    | … |   M360   |
+-----+-------+--------+--------+---------+---------+---------+---+---------+
| 1   | 6F0ZC | 0.068  | 0.0691 | 37.727  | 42.6139 | 41.7356 | … | 44.9293 |
| 2   | 6F0ZY | 0.0641 | 0.0661 | 37.2551 | 43.2009 | 40.8979 | … | 45.7524 |
| 3   | 6F106 | 0.0661 | 0.0676 | 36.9686 | 42.9519 | 41.262  | … | 45.7038 |
| 4   | 6F108 | 0.0685 | 0.069  | 38.3026 | 43.5699 | 42.3    | … | 46.1701 |
| 5   | 6F10A | 0.0657 | 0.0668 | 37.8442 | 43.2453 | 41.7191 | … | 45.7597 |
| 6   | 6F19W | 0.0682 | 0.071  | 38.6493 | 42.4611 | 42.2224 | … | 45.3165 |
| 7   | 6F1A0 | 0.0681 | 0.069  | 39.3956 | 44.2963 | 44.1344 | … | 46.5918 |
| 8   | 6F1A6 | 0.0662 | 0.0666 | 38.5942 | 42.6359 | 42.2369 | … | 45.4439 |
| .   | .     | .      | .      | .       | .       | .       | . | .       |
| .   | .     | .      | .      | .       | .       | .       | . | .       |
| .   | .     | .      | .      | .       | .       | .       | . | .       |
| 199 | 6F1AA | 0.0665 | 0.0672 | 40.438  | 44.9896 | 44.9409 | … | 47.5938 |
| 200 | 6F1AC | 0.0659 | 0.0681 | 39.528  | 44.606  | 43.2454 | … | 46.4338 |
+-----+-------+--------+--------+---------+---------+---------+---+---------+

I am trying to find correlations within these measurements and check for highly correlated features and visualize them. With so many columns, I am not able to do the regular correlation plots. (chart.Correlation,corrgram,etc..)

I also tried using qgraph but the measurements get cluttered at one place and is not very intuitive.

library(qgraph)
qgraph(cor(df[-c(1)], use="pairwise"), 
       layout="spring",
       label.cex=0.9,  
       minimum = 0.90,
       label.scale=FALSE)

Is there a good approach to visualize it & tell how these measurements are correlated with each other?

jlhoward · Accepted Answer

As mentioned in a comment, corrplot(...) might be a good option. Here is a ggplot option that does something similar. The basic idea is to draw a heat map, where color represents the correlation coefficient.

# create artificial dataset - you have this already
set.seed(1)   # for reproducible example
df <- matrix(rnorm(180*100),nr=100)
df <- do.call(cbind,lapply(1:180,function(i)cbind(df[,i],2*df[,i])))

# you start here
library(ggplot2)
library(reshape2)
cor.df <- as.data.frame(cor(df))
cor.df$x <- factor(rownames(cor.df), levels=rownames(cor.df))
gg.df <- melt(cor.df,id="x",variable.name="y", value.name="cor")
# tiles colored continuously based on correlation coefficient
ggplot(gg.df, aes(x,y,fill=cor))+
  geom_tile()+
  scale_fill_gradientn(colours=rev(heat.colors(10)))
  coord_fixed()

# tiles colors based on increments in correlation coefficient
gg.df$level <- cut(gg.df$cor,breaks=6)
ggplot(gg.df, aes(x,y,fill=level))+
  geom_tile()+
  scale_fill_manual(values=rev(heat.colors(5)))+
  coord_fixed()

Note the diagonal. This is by design - the contrived data is set up so that rows i and i+1 are perfectly correlated, for every other row.

How to find correlations in a dataset containing over 350 columns in R

Answers (1)

Related Questions