Reputation: 1
I would like to compare different vectors (1 per subject) between two groups. What I want to do is similar to the work performed in this paper.
https://www.pnas.org/content/early/2020/08/14/2003181117 Figure 2B.
So, I have already a list of normalized vectors for each group such as:
X = array([[0.8081178 , 0.1618492 , 1. , 0. , 0.52503616],
[0.9155495 , 0.9229482 , 0.55023754, 0. , 1. ],
[0.5497678 , 1. , 0.5295068 , 0. , 0.9580641 ],
[0.8554752 , 0. , 1. , 0.27967405, 0.43231127],
[0.8771384 , 0.15983552, 1. , 0.24160399, 0. ],
[1. , 0. , 0.34030336, 0.8518671 , 0.14370875],
[0.96829957, 0.89825296, 0.9989327 , 0. , 1. ],
[0.19713035, 1. , 0.8313886 , 0. , 0.69545555],
[1. , 0. , 0.15145707, 0.62412727, 0.19574052],
[1. , 0. , 0.6768882 , 0.3267132 , 0.53155863],
[0. , 0.11568664, 1. , 0.06043369, 0.2405336 ],
[1. , 0.7901962 , 0.55479664, 0. , 0.21075204],
[0.8389194 , 0.9723087 , 0.9122212 , 0. , 1. ],
[1. , 0. , 0.74783736, 0.27481842, 0.54764044],
[0.7932238 , 0.78063756, 1. , 0. , 0.76313186],
[0. , 0.28478605, 1. , 0.48485696, 0.5902692 ]])
Y = array([[1. , 0.8730191 , 0.72493815, 0. , 0.9373017 ],
[1. , 0.8563728 , 0.71862656, 0. , 0.74088454],
[0.878855 , 0.8799178 , 1. , 0. , 0.8985272 ],
[0.94998175, 0.924029 , 0.74815565, 0. , 1. ],
[1. , 0.4086177 , 0.3750266 , 0. , 0.87822354],
[0.85906726, 1. , 0.37570593, 0. , 0.9324212 ],
[0.8055762 , 1. , 0.85996395, 0. , 0.9541106 ],
[0.96801126, 1. , 0.72156 , 0. , 0.8689768 ],
[1. , 0.9446373 , 0.5445604 , 0. , 0.56854314],
[0.86714363, 1. , 0.6032697 , 0. , 0.7075365 ],
[1. , 0.8875634 , 0.8770225 , 0. , 0.8542803 ],
[1. , 0.93619907, 0.8262237 , 0. , 0.87035996],
[1. , 0.8533749 , 0.8739984 , 0. , 0.97969407],
[1. , 0.63581806, 0.7951289 , 0. , 0.88310444],
[0.82491845, 1. , 0.6478972 , 0. , 0.8846024 ],
[1. , 0.79563105, 0.55089736, 0. , 0.90971696]])
I would like to perform a permutation test of the spatial distance (cosine similarity) of the average group vector. The purpose of that is two identify if the vectors of each group (X, Y) can be considered different or not. I already know how to calculated the spatial distance ex :
from scipy import spatial
Cosin = spatial.distance.cosine(np.mean(X, axis=0)
However, what they have done in this paper is first: randomly divided the vector into two groups second: calculate the spatial distance third: test wether if their cosine value is different from random (with a permutation test)
I don't know how to integrate that into sklearn.model_selection.permutation_test_score if this is the adapted permutation test?
Also, I found http://rasbt.github.io/mlxtend/user_guide/evaluate/permutation_test/ but in their function, X and Y can't have different shapes...
I may have a solution based on: https://stats.stackexchange.com/questions/330540/how-to-interpret-very-low-similarity-score-of-two-vectors-but-having-significant
import sys
import math, random
from scipy import stats
similarity = lambda x1, x2: sum(xj*xk for xj,xk in zip(x1, x2))/math.sqrt(sum(xj**2 for xj in x1)*sum(xk**2 for xk in x2))
x1 = np.mean(X, axis=0)
x2 = np.mean(Y, axis=0)
s = similarity(x1, x2)
## permutation test
sr = []
for j in list(range(1,10000)):
concat_arrays = np.concatenate((X, Y), axis=0)
np.random.shuffle(concat_arrays)
#put the number of indiv mac or lemur or human
split = np.split(concat_arrays, [len(x)])
sr.append(similarity(np.mean(split[0], axis=0), np.mean(split[1], axis=0)))
shape, loc, scale = stats.weibull_min.fit(sr)
## -log10(p)
ej = ((s-loc)/scale)**shape*math.log10(math.exp(1.))
p = 10**(-ej)
What do you think about this proposition? For "len(x)", I don't know if I am supposed to have the same shape as the original array of my two groups?
Upvotes: 0
Views: 602
Reputation: 86
The cosine similarity computation proposed by Shaeffer et al. seems to be based on a bootstraping of different cosine similarity measurement. In that sense, I think the two groups are stacked and then divided in half. The bootstraping is smoothing up the random dividing of all Individual fingerptints.
I didn't test your code but I don't see any major issue with it.
Your len(x)
should then be equal to half of the stacked size of all individual fingerprints. If it's odd then ignore one or double it in both group.
Upvotes: 0