Reputation: 1656

Using the hypergeometric test in python

I have two gene lists and I calculate the intersection between them.
I need to calculate the p value for the hypothesize that - the intersection of these lists occured by chance.
I tried to implement that using fisher's exact test (scipy function).
Notice that I need a one sided p value.

My code:

def main(gene_path1, gene_path2, pop_size):
    genes1 = pd.read_csv(gene_path1, sep='\n', header=None)
    genes2 = pd.read_csv(gene_path2, sep='\n', header=None)

    intersection = pd.merge(genes1, genes2, how='inner').drop_duplicates([0])

    len_genes1 = genes1[0].count()
    len_genes2 = genes2[0].count()
    len_intersection = intersection[0].count()

    oddsratio, pvalue = stats.fisher_exact([[len_genes1 - len_intersection, len_genes1], [len_genes2 - len_intersection, len_genes2]], alternative='less')

    print(f'Genes1 len: {len_genes1}, Genes2 len: {len_genes2}, Intersection: {len_intersection}, pvalue: {pvalue}')

For the sake of simplicity, I used a list of numbers (not genes).

Since it's too long I won't copy the entire file but imagine two files with lots of random numbers seperated by a newline.
For example:

The question is - how can I be sure that I specified the arguments for the fisher's exact function correctly? is it right according to the hypothesize i am trying to check?
I suspect that I have done it incorrectly but i'm not sure why. might be a hint for what's wrong - I understand that the population size should be relevant but I am not sure where to use it and how.
Any leads or insights would be appreciated.

UPDATE:
I tried to implement it in a different way.

from scipy.stats import hypergeom as hg
import pandas as pd
def main(gene_path1, gene_path2, pop_size):
    genes1 = pd.read_csv(gene_path1, sep='\n', header=None)
    genes2 = pd.read_csv(gene_path2, sep='\n', header=None)

    intersection = pd.merge(genes1, genes2, how='inner').drop_duplicates([0])

    len_genes1 = genes1[0].count()
    len_genes2 = genes2[0].count()
    len_intersection = intersection[0].count()
    pvalue = hg.cdf(int(len_intersection)-1, int(pop_size), int(len_genes1), int(len_genes2))
    print(f'Genes1 len: {len_genes1}, Genes2 len: {len_genes2}, Intersection: {len_intersection}, p value: {pvalue})

I am just wondering if I got the arguments in the right place, how could I validate that?

Upvotes: 1

Answers (2)

O.rka

Reputation: 30677

This should help too: http://pedagogix-tagc.univ-mrs.fr/courses/ASG1/practicals/go_statistics_td/go_statistics_td_2015.html

g = 75 ## Number of submitted genes
k = 59 ## Size of the selection, i.e. submitted genes with at least one annotation in GO biological processes
m = 611 ## Number of "marked" elements, i.e. genes associated to this biological process
N = 13588 ## Total number of genes with some annotation in GOTERM_BP_FAT.
n = N - m ## Number of "non-marked" elements, i.e. genes not associated to this biological process
x = 19 ## Number of "marked" elements in the selection, i.e. genes of the group of interest that are associated to this biological process

# Python
stats.hypergeom(M=N, 
                n=m, 
                N=k).sf(x-1)
# 4.989682834451419e-12

# R
phyper(q=x -1, m=m, n=n, k=k, lower.tail=FALSE)
# [1] 4.989683e-12

Upvotes: 5

Parisa Daj

Reputation: 1

I wonder if you still have the same issue or not. However, I found this link pretty useful to make sure of your hypergeometric test results. Regarding your calculations, your results has to be equal to Cumulative Probability: P(X < int(len_intersection))

Upvotes: 0

Using the hypergeometric test in python

Answers (2)

Related Questions