Reputation: 2655
I have a dataframe df of the form
class_1_frequency class_2_frequency
group_1 20 10
group_2 60 25
..
group_n 50 15
Suppose class_1 has a total of 70 members and class_2 has 30.
For each row (group_1, group_2,..group_n) I want to create contingency tables (preferably dynamically) and then carry out a chisquare test to evaluate p-values.
For example, for group_1, the contingency table under the hood would look like:
class_1 class_2
group_1_present 20 10
group_1_absent 70-20 30-10
Also, I know scipy.stats.chi2_contingency() is the appropriate function for chisquare, but I am not able to apply it to my context. I have looked at previously discussed questions such as: here and here.
What is the most efficient way to achieve this?
Upvotes: 0
Views: 2775
Reputation: 2678
You can take advantage of the apply
function on pd.DataFrame
. It allows to apply arbitrary functions to columns or rows of a DataFrame
. Using your example:
df = pd.DataFrame([[20, 10], [60, 25], [50, 15]])
To produce the contingency tables one can use lambda
and some vector operations
>>> members = np.array([70, 30])
>>> df.apply(lambda x: np.array([x, members-x]), axis=1)
0 [[20, 10], [50, 20]]
1 [[60, 25], [10, 5]]
2 [[50, 15], [20, 15]]
And this can of course be wrapped with the scipy
function.
df.apply(lambda x: chi2_contingency(np.array([x, members-x])), axis=1)
This produces all possible return values, but by slicing the output, one is able to specify the wanted return values, leaving e.g. the expected arrays. The resulting series can also be converted to a DataFrame
.
>>> s = df.apply(lambda x: chi2_contingency(np.array([x, members-x]))[:-1], axis=1)
>>> s
0 (0.056689342403628114, 0.8118072280034329, 1)
1 (0.0, 1.0, 1)
2 (3.349031920460492, 0.06724454934343391, 1)
dtype: object
>>> s.apply(pd.Series)
0 1 2
0 0.056689 0.811807 1.0
1 0.000000 1.000000 1.0
2 3.349032 0.067245 1.0
Now I don't know about the execution efficiency of this approach, but I'd trust the ones who have implemented these functions. And most likely the speed is not that critical. But it is at least efficient in the sense that it is (hypothetically) easy to understand and fast to write.
Upvotes: 1