Reputation: 317
I have performed a hypergeometric analysis (using a python script) to investigate enrichment of GO-terms in a subset of genes. An example of my output is as follows:
GO00001 1500 300 200 150 5.39198144708e-77
GO00002 1500 500 400 350 1.18917839281e-160
GO00003 1500 400 350 320 9.48402847878e-209
GO00004 1500 100 100 75 3.82935778527e-82
GO00005 1500 100 80 80 2.67977253966e-114
where
Column1 = GO ID
Column2 = Total sum of all terms in the original dataset
Column3 = Total sum of [Column 1] IDs in the original dataset
Column4 = Sum of all terms in the subset
Column5 = Sum of [Column 1] IDs in subset
Column6 = pvalue derived from hypergeometric test
I know that I must multiply the number of experiments by the pvalue but I'm not sure how to do this with the data I have. Am I calculating from the subset or a combination of the original dataset and the subset? For example, would it be:
Column2 * Column5 * pvalue
Column3 * Column5 * pvalue
Column4 * Column5 * pvalue
I apologise if this seems like a stupid question but I just can't seem to get my head around it. Many thanks in advance!
Upvotes: 3
Views: 16412
Reputation: 23101
We can implement the Bonferroni correction for multiple testing on our own like the following
np.random.seed(123)
alpha = 0.05 # level of significance / type-I error rate
m = 100 # number of tests
raw_pvals = np.random.beta(1, 10, m) # some raw p-values, e.g., from hypergeometric analysis
significant = np.sum(raw_pvals < alpha)
significant
# 46
alpha_corrected = alpha / m
significant_bonferroni = np.sum(raw_pvals < alpha_corrected)
alpha_corrected
# 0.0005
significant_bonferroni
# 2
or we can use multipletests
from statsmodels.stats
:
from statsmodels.stats.multitest import multipletests
rejected, p_adjusted, _, alpha_corrected = multipletests(raw_pvals, alpha=alpha,
method='bonferroni', is_sorted=False, returnsorted=False)
np.sum(rejected)
# 2
alpha_corrected
# 0.0005
We can plot the distribution of raw vs adjusted p-values:
import seaborn as sns
sns.kdeplot(raw_pvals, color="red", shade=True, label='raw')
ax = sns.kdeplot(p_adjusted, color="green", shade=True, label='adujusted')
ax.set(xlim=(0, 1))
plt.title('distribution of p-values')
plt.legend()
Note that, as expected, Bonferroni is very conservative in the sense that it allowed rejection of only a couple of null hypothesis propositions.
Upvotes: 2
Reputation: 5976
from statsmodels.sandbox.stats.multicomp import multipletests
p_adjusted = multipletests(Column6, method='bonferroni')
Or am I missing something?..
Upvotes: 9