Reputation: 8207
Using ipython notebook, a pandas dataframe has 4 columns: numerator1, numerator2, denominator1 and denominator2.
Without iterating through each record, I am trying to create a fifth column titled FishersExact. I would like the value of the column to store the tuple returned by scipy.stats.fisher_exact using values (or some derivation of the values) from each of the four columns as my inputs.
df['FishersExact'] = scipy.stats.fisher_exact( [[df.numerator1, df.numerator2],
[df.denominator1 - df.numerator1 , df.denominator2 - df.numerator2]])
Returns:
/home/kevin/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in fisher_exact(table, alternative)
2544 c = np.asarray(table, dtype=np.int64) # int32 is not enough for the algorithm
2545 if not c.shape == (2, 2):
-> 2546 raise ValueError("The input `table` must be of shape (2, 2).")
2547
2548 if np.any(c < 0):
ValueError: The input `table` must be of shape (2, 2).
When I index only the first record of the dataframe:
odds,pval = scipy.stats.fisher_exact([[df.numerator1[0], df.numerator2[0]],
[df.denominator1[0] - df.numerator1[0], df.denominator2[0] -df.numerator2[0]]])
This is returned:
1.1825710754 0.581151431104
I'm essentially trying to emulate the arithmetic functionality similar to:
df['freqnum1denom1'] = df.numerator1 / df.denominator1
which returns a new column added to the dataframe where each records' frequency is in the new column.
Probably missing something, any direction would be greatly appreciated, thank you!
Upvotes: 3
Views: 3147
Reputation: 76297
It looks like you're building a matrix of pandas
series, and passing it to the function. The function wants a matrix of scalars; you can call it multiple times. These two things are not quite the same.
There are (at least) two ways to go here.
Using apply
You could use pandas
's apply
for this.
df['FishersExact'] = df.apply(
lambda r: scipy.stats.fisher_exact([[r.numerator1, ... ]]),
axis=1)
Note the following:
axis=1
applies a function to each row.
Within the lambda
, r.numerator
is a scalar.
Back To Basics
Fischer's exact test can be described as vectorized operations in the original columns, which should be much faster. To increase the speed to the maximum, you need to use a vectorized version of factorial (which I don't know). This could even be a separate (good!) SO question.
Upvotes: 2