user5481267
user5481267

Reputation: 141

Creating a correlation matrix from a data frame in R

I have a data frame of correlations which looks something like this (although there are ~15,000 rows in my real data)

phen1<-c("A","B","C")
phen2<-c("B","C","A")
cors<-c(0.3,0.7,0.8)

data<-as.data.frame(cbind(phen1, phen2, cors))

    phen1  phen2   cors
1     A      B      0.3
2     B      C      0.7
3     C      A      0.8

This was created externally and read into R and I want to convert this data frame into a correlation matrix with phen1 and 2 as the labels for rows and columns of this matrix. I have only calculated this for either the lower or upper triangle and I don't have the 1's for the Diagnonal. So I would like the end results to be a full correlation matrix but a first step would probably be to create the lower/upper triangle and then convert to a full matrix I think. I'm unsure how to do either step of this.

Also, the results may not be in an intuitive order, but I'm not sure if this matters, but ideally I would like a way to do this which uses the labels in phen1 and phen 2 to make sure the matrix has the correct values in the correct place if that makes sense?

Essentially for this, I would want something like this as an end result:

  A    B    C
A 1    0.3  0.8
B 0.3  1    0.7
C 0.8  0.7  1

Upvotes: 7

Views: 2970

Answers (7)

Jeromy Anglim
Jeromy Anglim

Reputation: 34907

Here is a function that I wrote:

long2cormat <- function(xlong, x = "x", y = "y", r = "r") {
    # Takes some inspiration from https://stackoverflow.com/a/57904948/180892
    xlong <- xlong[,c(x, y, r)]
    names(xlong) <- c("x", "y", "r")
    
    data1 <- data.frame(x = xlong$x, y = xlong$y, r = xlong$r)    
    data2 <- data.frame(x = xlong$y, y = xlong$x, r = xlong$r)  
    df <- rbind(data1, data2)
    
    uv <- unique(c(df$x, df$y))
    df1 <- matrix(NA, nrow = length(uv), ncol = length(uv), dimnames = list(uv, uv))
    for (i in seq(nrow(df))) df1[df$x[i], df$y[i]] <- df$r[i]
    diag(df1) <- 1
    df1
}

To run it do the following:

xlong <- data.frame(phen1 = c("A","B","C"),
    phen2 = c("B","C","A"),
    cors = c(0.3,0.7,0.8))
long2cormat(xlong, "phen1", "phen2", "cors")

Importantly, for my own use cases, it leaves missing correlations as NA.

Upvotes: 0

Joe
Joe

Reputation: 8601

Plenty of solutions already, but I'll throw in another way. Note: I'm setting up the data so that cors is numeric rather than a factor in your original data frame.

data <- data.frame(phen1, phen2, cors)

Then we can expand the data frame with missing combinations and then uses reshape2::acast() to convert the data to wide format.

library(tidyverse)
library(reshape2)

data %>% 
  select(phen1 = phen2, phen2 = phen1, cors) %>%
  bind_rows(data) %>%
  acast(phen1 ~ phen2, fill = 1)

acast handily lets you fill in the missing values with some other specified value, in this case 1.

Also, check out the corrr package, which may be able to do this more neatly.

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388797

Here is another one in base R where we create a symmetrical dataframe same as data but with columns inverted for phen1 and phen2. Then we use xtabs to get a correlation matrix and set diagonal to 1.

data1 <- data.frame(phen1 = data$phen2, phen2 = data$phen1, cors = data$cors)  
df <- rbind(data, data1)
df1 <- as.data.frame.matrix(xtabs(cors ~ ., df))
diag(df1) <- 1
df1

#    A   B   C
#A 1.0 0.3 0.8
#B 0.3 1.0 0.7
#C 0.8 0.7 1.0

data

phen1<-c("A","B","C")
phen2<-c("B","C","A")
cors<-c(0.3,0.7,0.8)
data<- data.frame(phen1, phen2, cors)

Upvotes: 4

Maurits Evers
Maurits Evers

Reputation: 50668

Here's another option.

First reshape data from long to wide and convert to a matrix. You have different options to do that (reshape2, tidyr, etc.); here I use tidyr::spread.

library(tidyverse)
mat <- data %>% spread(phen2, cors) %>% column_to_rownames("phen1") %>% as.matrix()

We then fill the missing NA values from the upper and lower triangular matrix respectively, and fill the diagonal with 1.

mat[lower.tri(mat)] <- mapply(sum, mat[lower.tri(mat)], mat[upper.tri(mat)], na.rm = T)
mat[upper.tri(mat)] <- mat[lower.tri(mat)]
diag(mat) <- 1
mat
#    A   B   C
#A 1.0 0.3 0.8
#B 0.3 1.0 0.7
#C 0.8 0.7 1.0

Upvotes: 1

Roland
Roland

Reputation: 132576

You can use the Matrix package for this. What you have is a sparse representation of the data and you want to turn this into a dense (redundant) matrix.

data <- data.frame(phen1, phen2, cors)

inds <- cbind(as.integer(data$phen1), as.integer(data$phen2))
inds <- t(apply(inds, 1, sort))

library(Matrix)
res <- sparseMatrix(i = inds[,1], 
             j = inds[,2], 
             x = data$cors,
             symmetric = TRUE)
#3 x 3 sparse Matrix of class "dsCMatrix"
#
#[1,] .   0.3 0.8
#[2,] 0.3 .   0.7
#[3,] 0.8 0.7 . 

res <- as.matrix(res)
diag(res) <- 1
dimnames(res) <- list(sort(data$phen1), sort(data$phen2))
res
#    A   B   C
#A 1.0 0.3 0.8
#B 0.3 1.0 0.7
#C 0.8 0.7 1.0

Upvotes: 3

tmfmnk
tmfmnk

Reputation: 39858

I think there must be an elegant way to do it, however, here is a dplyr and tidyr possibility:

data %>%
 spread(phen1, cors) %>%
 rename(phen = "phen2") %>%
 bind_rows(data %>%
            spread(phen2, cors) %>%
            rename(phen = "phen1")) %>%
 group_by(phen) %>%
 summarise_all(~ ifelse(all(is.na(.)), 1, first(na.omit(.))))

  phen      A     B     C
  <chr> <dbl> <dbl> <dbl>
1 A       1     0.3   0.8
2 B       0.3   1     0.7
3 C       0.8   0.7   1  

Upvotes: 3

kashiff007
kashiff007

Reputation: 386

You can use reshape library.

library(reshape)
data <- melt(data)
your_mat <- cast(data, phen1 ~ phen2 )

Output:

  phen1    A    B    C
1     A <NA>  0.3 <NA>
2     B <NA> <NA>  0.7
3     C  0.8 <NA> <NA>

The reason you will NAs because you have many missing combination from your input table. For avoiding this you need an input table like this:

  phen1 phen2 cors
1     A     B  0.3
2     B     C  0.7
3     C     A  0.8
4     A     C  0.8
5     B     A  0.3
6     C     B  0.7
7     A     A  1.0
8     B     B  1.0
9     C     C  1.0

Upvotes: 1

Related Questions