Reputation: 112
I have two groups of patients which contains same dim for genes but different dim for patients (samples). Different samples in each group is a biological replication.
sample1 <- structure(c(3.990406045, 4.041745321, 4.002401404, 4.031463584,
4.046886189, 4.00582865, 3.985265177, 3.9869788, 3.995546913,
4.00582865, 11.75549075, 11.81394311, 11.81826206, 11.76013913,
11.8408451, 11.83619671, 11.72858876, 11.73755609, 11.78239274,
11.83619671, 8.647734791, 8.606480387, 8.64648886, 8.607548328,
8.605946416, 8.646132879, 8.648268762, 8.648090771, 8.647200821,
8.646132879, 5.359884744, 5.371302287, 5.37638989, 5.357155019,
5.378375921, 5.381105646, 5.35281111, 5.355168988, 5.366958378,
5.381105646, 8.805045323, 8.684889613, 8.794736874, 8.693725426,
8.680471706, 8.791791603, 8.80946323, 8.807990594, 8.800627416,
8.791791603, 10.87587031, 10.85539252, 10.87095037, 10.85960961,
10.85328398, 10.86954467, 10.87797885, 10.877276, 10.87376176,
10.86954467, 5.505422817, 5.530799682, 5.631682175, 5.422577376,
5.584910836, 5.667756277, 5.451311664, 5.469348715, 5.559533971,
5.667756277), .Dim = c(10L, 7L), .Dimnames = list(c("patient1",
"patient2", "patient3", "patient4",
"patient5", "patient6", "patient7",
"patient8", "patient9", "patient10"
), c("gene1", "gene2", "gene3", "gene4", "gene5", "gene6", "gene7"
)))
and
sample2 <- structure(c(3.990406045, 4.041745321, 4.002401404, 4.031463584,
4.046886189, 4.00582865, 3.985265177, 3.9869788, 11.75549075,
11.81394311, 11.81826206, 11.76013913, 11.8408451, 11.83619671,
11.72858876, 11.73755609, 8.647734791, 8.606480387, 8.64648886,
8.607548328, 8.605946416, 8.646132879, 8.648268762, 8.648090771,
5.359884744, 5.371302287, 5.37638989, 5.357155019, 5.378375921,
5.381105646, 5.35281111, 5.355168988, 8.805045323, 8.684889613,
8.794736874, 8.693725426, 8.680471706, 8.791791603, 8.80946323,
8.807990594, 10.87587031, 10.85539252, 10.87095037, 10.85960961,
10.85328398, 10.86954467, 10.87797885, 10.877276, 5.505422817,
5.530799682, 5.631682175, 5.422577376, 5.584910836, 5.667756277,
5.451311664, 5.469348715), .Dim = c(8L, 7L), .Dimnames = list(
c("patient1",
"patient2", "patient3", "patient4",
"patient5", "patient6", "patient7",
"patient8"), c("gene1", "gene2", "gene3", "gene4", "gene5", "gene6", "gene7")))
Now, I want to check the correlation between gene pairs between two groups
rcorr(sample1, sample2, type="s")#spearman
I reveive:
Error in cbind(x, y) : number of rows of matrices must match (see arg 2)
but for correlation of patients t(samples) brings the correlation between patient pairs. I need correlations between gene pairs (below). Is there anything wrong? Should I consider some statistical points?
when length of patients are equal I reverie this:
> rcorr(sample1[1:8, ], sample2[1:8,], type="s")
gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene1 gene2 gene3 gene4 gene5 gene6 gene7
gene1 1.00 0.81 -1.00 0.67 -1.00 -1.00 0.36 1.00 0.81 -1.00 0.67 -1.00 -1.00 0.36
gene2 0.81 1.00 -0.81 0.95 -0.81 -0.81 0.79 0.81 1.00 -0.81 0.95 -0.81 -0.81 0.79
gene3 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36
gene4 0.67 0.95 -0.67 1.00 -0.67 -0.67 0.90 0.67 0.95 -0.67 1.00 -0.67 -0.67 0.90
gene5 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36
gene6 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36
gene7 0.36 0.79 -0.36 0.90 -0.36 -0.36 1.00 0.36 0.79 -0.36 0.90 -0.36 -0.36 1.00
gene1 1.00 0.81 -1.00 0.67 -1.00 -1.00 0.36 1.00 0.81 -1.00 0.67 -1.00 -1.00 0.36
gene2 0.81 1.00 -0.81 0.95 -0.81 -0.81 0.79 0.81 1.00 -0.81 0.95 -0.81 -0.81 0.79
gene3 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36
gene4 0.67 0.95 -0.67 1.00 -0.67 -0.67 0.90 0.67 0.95 -0.67 1.00 -0.67 -0.67 0.90
gene5 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36
gene6 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36
gene7 0.36 0.79 -0.36 0.90 -0.36 -0.36 1.00 0.36 0.79 -0.36 0.90 -0.36 -0.36 1.00
As is seen there is a duplication in matrix, too. Why?
Upvotes: 2
Views: 4230
Reputation: 82
You can not find correlation between two vectors with different length, correlation need pairwise data to be calculated and it applies to all correlation methods.
Missing value estimation and time series models (such as GARCH) are not recommended since you are using biological data and pattern may vary between different patient and these methods are not able to consider all factors which may change the phenomena.
The best solution, I think, is to delete the extra data and make both samples having the same patient numbers. R has a built in function cor. The "use" argument in it helps you to ignore NA values.
Here is the link: http://www.statmethods.net/stats/correlations.html
Upvotes: 2
Reputation: 13304
Regarding the last portion of your question: rcorr
binds matrices sample1
and sample2
by columns and uses the combined matrix to calculate rank correlation coefficients. If you gave different names to the genes in sample1 and sample2, e.g.:
colnames(sample1) <- sprintf('sample1.%s',colnames(sample1))
colnames(sample2) <- sprintf('sample2.%s',colnames(sample2))
you would see that you have a block matrix with the diagonal blocks corresponding to the coefficients inside each sample (sample1
-sample1
and sample2
-sample2
), and the off-diagonal blocks--to the coefficients between sample1
and sample2
.
rcorr(sample1[1:8,],sample2[1:8,],type='s')
sample1.gene1 sample1.gene2 sample1.gene3 sample1.gene4 sample1.gene5 sample1.gene6 sample1.gene7 sample2.gene1 sample2.gene2 sample2.gene3 sample2.gene4
sample1.gene1 1.00 0.81 -1.00 0.67 -1.00 -1.00 0.36 1.00 0.81 -1.00 0.67
sample1.gene2 0.81 1.00 -0.81 0.95 -0.81 -0.81 0.79 0.81 1.00 -0.81 0.95
sample1.gene3 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67
sample1.gene4 0.67 0.95 -0.67 1.00 -0.67 -0.67 0.90 0.67 0.95 -0.67 1.00
sample1.gene5 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67
sample1.gene6 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67
sample1.gene7 0.36 0.79 -0.36 0.90 -0.36 -0.36 1.00 0.36 0.79 -0.36 0.90
sample2.gene1 1.00 0.81 -1.00 0.67 -1.00 -1.00 0.36 1.00 0.81 -1.00 0.67
sample2.gene2 0.81 1.00 -0.81 0.95 -0.81 -0.81 0.79 0.81 1.00 -0.81 0.95
sample2.gene3 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67
sample2.gene4 0.67 0.95 -0.67 1.00 -0.67 -0.67 0.90 0.67 0.95 -0.67 1.00
sample2.gene5 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67
sample2.gene6 -1.00 -0.81 1.00 -0.67 1.00 1.00 -0.36 -1.00 -0.81 1.00 -0.67
sample2.gene7 0.36 0.79 -0.36 0.90 -0.36 -0.36 1.00 0.36 0.79 -0.36 0.90
sample2.gene5 sample2.gene6 sample2.gene7
sample1.gene1 -1.00 -1.00 0.36
sample1.gene2 -0.81 -0.81 0.79
sample1.gene3 1.00 1.00 -0.36
sample1.gene4 -0.67 -0.67 0.90
sample1.gene5 1.00 1.00 -0.36
sample1.gene6 1.00 1.00 -0.36
sample1.gene7 -0.36 -0.36 1.00
sample2.gene1 -1.00 -1.00 0.36
sample2.gene2 -0.81 -0.81 0.79
sample2.gene3 1.00 1.00 -0.36
sample2.gene4 -0.67 -0.67 0.90
sample2.gene5 1.00 1.00 -0.36
sample2.gene6 1.00 1.00 -0.36
sample2.gene7 -0.36 -0.36 1.00
It happened that sample1
and sample2
are identical in your example, so that is why you have all block matrices equal.
UPDATE:
sample1-sample2 correlations can be calculated using cor
function:
library(reshape2)
# produce all combinations of column indices for sample1 and sample2
z <- expand.grid(s1=1:7,s2=1:7)
# due to the correlation matrix symmetry, we can calculate only an upper right trigonal matrix
z <- z[z$s2<z$s1,]
# calculate correlations
z$corr <- mapply(function(i,j) cor(sample1[1:8,i],sample2[1:8,j],method='spearman'),z$s1,z$s2)
# reshape the result into a trigonal matrix
corr.coefs <- dcast(z,s2~s1,value.var='corr')
Upvotes: 2