Reputation: 5059
I have a expression values (log2) for 200 genes in two conditions treated and untreated and for each condition I have 20 replicates. I want to calculate the correlation between each condition for each gene and rank them from highest to lowest.
This is more of a biostats problem, but still I think it is an important one for biologists/bio-programmers many of us encounter this.
The dataset looks like this:
Gene UT1 UT2 T1 T2
DDR1 8.111795978 7.7606511867 7.9362235824 7.5974674936
RFC2 10.2418824097 9.7752152714 10.0085488406 9.5723427524
HSPA6 6.5850239731 6.7916563534 6.6883401632 7.3659252344
PAX8 9.2965160827 9.2031177653 9.249816924 8.667772504
GUCA1A 5.4828021059 5.3797749957 5.4312885508 5.1297319374
I have shown only two replicates for each sample in the sample data.
I am looking for a solution in R or python. cor function in R does not give me what i want.
Upvotes: 2
Views: 2991
Reputation: 10543
All sources I've read indicate that you need to create an average measure for each replicate. I've seen both mean
and median
used, although you may want to look into more advanced pre-processing/normalization methods (like RMA
). Once you've done that you can calculate the correlation between untreated and treated.
There is no way to calculate correlation in the way that you're looking for. Any method that would do so will ultimately boil down to summarizing the information across the two conditions through getting a summary probe measure across the replicates (as above).
Alternatively you could do something like calculate the correlation between each treated and untreated replicate for each probe, and take the average correlation.
Upvotes: 1
Reputation: 156
Assuming that the first column account for the names of the rows and first column for their names, i.e., assuming that your data contains only numeric values, you can simply do the following in R, which will give you a n x n matrix with all pairwise correlations between genes.
cor(data)
You may want to specify what type of correlation you want to use... What is the length of the time-series? There are whole studies developed to address the issue of selecting an appropriate measure, e.g., see:
Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Ivan G. Costa Filho, "Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 99, no. PrePrints, p. 1, , 2013
Upvotes: 0
Reputation: 2203
If I understand correctly from your question,you need to calculate correlation between UT1 and T1 and UT2 and T2 for all the Genes. There is a way to do it in R :
df <- data.frame(Gene = c("DDR1","RFC2","HSPA6","PAX8","GUCA1A")
, UT1 = c(8.111796, 10.241882, 6.585024 , 9.296516 , 5.482802),
UT2 =c( 7.760651 ,9.775215 ,6.791656, 9.203118, 5.379775),
T1 =c(7.936224 ,10.008549, 6.688340 , 9.249817 , 5.431289),
T2 =c(7.597467 ,9.572343 ,7.365925 ,8.667773 ,5.129732))
make a matrix like this :
mat1 <- cbind(file$UT1,file$T1)
initialize a correlation matrix :
cor1 <- matrix(0,length(file$Gene),length(file$Gene))
then calculate correlation all against all genes like this :
for(i in 1:length(df$Gene)) cor1[i,] = apply(mat1,1,function(x) cor(x,mat1[df$Gene[i],]))
I hope this helps.
Upvotes: 1