Kevin
Kevin

Reputation: 8207

Fisher's Exact in scipy as new column using pandas

Using ipython notebook, a pandas dataframe has 4 columns: numerator1, numerator2, denominator1 and denominator2.

Without iterating through each record, I am trying to create a fifth column titled FishersExact. I would like the value of the column to store the tuple returned by scipy.stats.fisher_exact using values (or some derivation of the values) from each of the four columns as my inputs.

df['FishersExact'] = scipy.stats.fisher_exact( [[df.numerator1, df.numerator2],
[df.denominator1 - df.numerator1 , df.denominator2 - df.numerator2]])

Returns:

/home/kevin/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in fisher_exact(table, alternative)
2544     c = np.asarray(table, dtype=np.int64)  # int32 is not enough for the algorithm
2545     if not c.shape == (2, 2):
-> 2546         raise ValueError("The input `table` must be of shape (2, 2).")
2547 
2548     if np.any(c < 0):

ValueError: The input `table` must be of shape (2, 2).

When I index only the first record of the dataframe:

odds,pval = scipy.stats.fisher_exact([[df.numerator1[0], df.numerator2[0]], 
[df.denominator1[0] - df.numerator1[0], df.denominator2[0] -df.numerator2[0]]])

This is returned:

1.1825710754 0.581151431104

I'm essentially trying to emulate the arithmetic functionality similar to:

df['freqnum1denom1'] = df.numerator1 / df.denominator1

which returns a new column added to the dataframe where each records' frequency is in the new column.

Probably missing something, any direction would be greatly appreciated, thank you!

Upvotes: 3

Views: 3147

Answers (1)

Ami Tavory
Ami Tavory

Reputation: 76297

It looks like you're building a matrix of pandas series, and passing it to the function. The function wants a matrix of scalars; you can call it multiple times. These two things are not quite the same.

There are (at least) two ways to go here.

Using apply

You could use pandas's apply for this.

df['FishersExact'] = df.apply(
    lambda r: scipy.stats.fisher_exact([[r.numerator1, ... ]]),
    axis=1)

Note the following:

  • axis=1 applies a function to each row.

  • Within the lambda, r.numerator is a scalar.

Back To Basics

Fischer's exact test can be described as vectorized operations in the original columns, which should be much faster. To increase the speed to the maximum, you need to use a vectorized version of factorial (which I don't know). This could even be a separate (good!) SO question.

Upvotes: 2

Related Questions