Melsauce
Melsauce

Reputation: 2655

How to perform chisquare tests on rows of pandas dataframes?

I have a dataframe df of the form

          class_1_frequency    class_2_frequency
group_1          20                    10
group_2          60                    25 
..
group_n          50                    15 

Suppose class_1 has a total of 70 members and class_2 has 30.

For each row (group_1, group_2,..group_n) I want to create contingency tables (preferably dynamically) and then carry out a chisquare test to evaluate p-values.

For example, for group_1, the contingency table under the hood would look like:

                   class_1      class_2
group_1_present      20           10
group_1_absent     70-20         30-10

Also, I know scipy.stats.chi2_contingency() is the appropriate function for chisquare, but I am not able to apply it to my context. I have looked at previously discussed questions such as: here and here.

What is the most efficient way to achieve this?

Upvotes: 0

Views: 2775

Answers (1)

Felix
Felix

Reputation: 2678

You can take advantage of the apply function on pd.DataFrame. It allows to apply arbitrary functions to columns or rows of a DataFrame. Using your example:

df = pd.DataFrame([[20, 10], [60, 25], [50, 15]])

To produce the contingency tables one can use lambda and some vector operations

>>> members = np.array([70, 30])
>>> df.apply(lambda x: np.array([x, members-x]), axis=1)
0    [[20, 10], [50, 20]]
1    [[60, 25], [10,  5]]
2    [[50, 15], [20, 15]]

And this can of course be wrapped with the scipy function.

df.apply(lambda x: chi2_contingency(np.array([x, members-x])), axis=1)

This produces all possible return values, but by slicing the output, one is able to specify the wanted return values, leaving e.g. the expected arrays. The resulting series can also be converted to a DataFrame.

>>> s = df.apply(lambda x: chi2_contingency(np.array([x, members-x]))[:-1], axis=1)
>>> s
0    (0.056689342403628114, 0.8118072280034329, 1)
1                                    (0.0, 1.0, 1)
2      (3.349031920460492, 0.06724454934343391, 1)
dtype: object
>>> s.apply(pd.Series)
          0         1    2
0  0.056689  0.811807  1.0
1  0.000000  1.000000  1.0
2  3.349032  0.067245  1.0

Now I don't know about the execution efficiency of this approach, but I'd trust the ones who have implemented these functions. And most likely the speed is not that critical. But it is at least efficient in the sense that it is (hypothetically) easy to understand and fast to write.

Upvotes: 1

Related Questions