Reputation: 5059

Correlation of two samples with replicates

I have a expression values (log2) for 200 genes in two conditions treated and untreated and for each condition I have 20 replicates. I want to calculate the correlation between each condition for each gene and rank them from highest to lowest.

This is more of a biostats problem, but still I think it is an important one for biologists/bio-programmers many of us encounter this.

The dataset looks like this:

Gene    UT1            UT2            T1             T2  
DDR1     8.111795978    7.7606511867   7.9362235824   7.5974674936
RFC2    10.2418824097   9.7752152714  10.0085488406   9.5723427524
HSPA6    6.5850239731   6.7916563534   6.6883401632   7.3659252344
PAX8     9.2965160827   9.2031177653   9.249816924    8.667772504
GUCA1A   5.4828021059   5.3797749957   5.4312885508   5.1297319374

I have shown only two replicates for each sample in the sample data.

I am looking for a solution in R or python. cor function in R does not give me what i want.

Upvotes: 2

Answers (3)

Scott Ritchie

Reputation: 10543

All sources I've read indicate that you need to create an average measure for each replicate. I've seen both mean and median used, although you may want to look into more advanced pre-processing/normalization methods (like RMA). Once you've done that you can calculate the correlation between untreated and treated.

There is no way to calculate correlation in the way that you're looking for. Any method that would do so will ultimately boil down to summarizing the information across the two conditions through getting a summary probe measure across the replicates (as above).

Alternatively you could do something like calculate the correlation between each treated and untreated replicate for each probe, and take the average correlation.

Upvotes: 1

Dr. No

Reputation: 156

Assuming that the first column account for the names of the rows and first column for their names, i.e., assuming that your data contains only numeric values, you can simply do the following in R, which will give you a n x n matrix with all pairwise correlations between genes.

cor(data)

You may want to specify what type of correlation you want to use... What is the length of the time-series? There are whole studies developed to address the issue of selecting an appropriate measure, e.g., see:

Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Ivan G. Costa Filho, "Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 99, no. PrePrints, p. 1, , 2013

Upvotes: 0

user1021713

Reputation: 2203

If I understand correctly from your question,you need to calculate correlation between UT1 and T1 and UT2 and T2 for all the Genes. There is a way to do it in R :

df <- data.frame(Gene = c("DDR1","RFC2","HSPA6","PAX8","GUCA1A")
, UT1 =  c(8.111796, 10.241882,  6.585024 , 9.296516 , 5.482802),
UT2 =c( 7.760651 ,9.775215 ,6.791656, 9.203118, 5.379775),
T1 =c(7.936224 ,10.008549,  6.688340 , 9.249817 , 5.431289),
T2 =c(7.597467 ,9.572343 ,7.365925 ,8.667773 ,5.129732))

make a matrix like this :

mat1 <- cbind(file$UT1,file$T1)

initialize a correlation matrix :

cor1 <- matrix(0,length(file$Gene),length(file$Gene))

then calculate correlation all against all genes like this :

for(i in 1:length(df$Gene)) cor1[i,] = apply(mat1,1,function(x) cor(x,mat1[df$Gene[i],]))

I hope this helps.

Upvotes: 1

Correlation of two samples with replicates

Answers (3)

Related Questions