ML33M
ML33M

Reputation: 415

R: Correlation matrix between multiple rows (objects) over multiple columns (variables)

I'm dealing with a dataframe of multiple rows (objects) over multiple columns (variables). I want to see if any rows (objects) are correlated. I've been through reading corr() and it seems for one variable, I can transpose my dataframe and feed it into the corr() function. but how do I deal with multiple variables of each observation/object? The end goal, plot the correlation matrix on a heatmap to eyeball interesting objects.

Examples as below:

Treatment <- c('Drug A','Drug B','Drug C','Drug D','Drug E','Drug F')
Measurment_V1 <- runif(6, 0, 3000)
Measurment_V2 <- runif(6, 0, 20)
Measurment_V3 <- runif(6, 0, 1)
Measurment_V4 <- runif(6, 0, 120000)
Measurment_V5 <- runif(6, 0, 100)

df<- as.data.frame(cbind(Treatment,Measurment_V1,Measurment_V2,Measurment_V3,Measurment_V4,Measurment_V5))

Each drug is explained by measurments V1-V5 (in realit there are a few hundreds columns) So how can get a correlation matrix between all the drugs ABCD then plot their correlation on a heatmap like the Hmisc library could do?

Upvotes: 1

Views: 1655

Answers (2)

Werner Hertzog
Werner Hertzog

Reputation: 2022

This might do it:

# Redo your data frame
df <- data.frame(Treatment,Measurment_V1,Measurment_V2,Measurment_V3,Measurment_V4,Measurment_V5)

# Transpose numeric columns
dft <- as.data.frame(t(df[,2:6]))

# Rename vars
names(dft) <- c("Drug_A","Drug_B","Drug_C","Drug_D","Drug_E","Drug_F")

# Correlation matrix
cor(dft)


Output:
          Drug_A    Drug_B    Drug_C    Drug_D    Drug_E    Drug_F
Drug_A 1.0000000 0.9995697 0.9999240 0.9999939 0.9998902 0.9999665
Drug_B 0.9995697 1.0000000 0.9998554 0.9994612 0.9998946 0.9997758
Drug_C 0.9999240 0.9998554 1.0000000 0.9998748 0.9999969 0.9999911
Drug_D 0.9999939 0.9994612 0.9998748 1.0000000 0.9998324 0.9999320
Drug_E 0.9998902 0.9998946 0.9999969 0.9998324 1.0000000 0.9999777
Drug_F 0.9999665 0.9997758 0.9999911 0.9999320 0.9999777 1.0000000

You can then use the above correlation matrix to plot a heatmap.

Notice that I used data.frame() to redo your data frame since it makes numeric columns.

Upvotes: 1

pseudospin
pseudospin

Reputation: 2777

I think you're actually looking at this problem the wrong way around. You should be treating the drugs as the variables and investigating the correlation structure of the measurements.

I.e. the correlation matrix of interest is

cor(cbind(Measurment_V1, Measurment_V2, Measurment_V3, Measurment_V4, Measurment_V5))

One approach is to do PCA on the measurements so that you can place the drugs in a standardised space.

Then you could look for clustering of the drugs in that space to see which are similar to each other. Note it's much harder to do clustering in the original space of the measurements as they are on very different scales - you have to standardise them somehow, which is what the PCA can do. It also reduces the dimensionality of the measurement space which will help you visualise what is going on.

Upvotes: 1

Related Questions