arkadiy
arkadiy

Reputation: 766

Automating the process of identifying subgroups of a pandas dataframe that do not significantly differ on a value

I have the following dataframe, which, for the sake of this example, is full of random numbers:

import numpy as np
import pandas as pd
from scipy.stats import ttest_ind

df = pd.DataFrame(np.random.randint(0,1000,size=(100, 4)), columns=list('ABCD'))
df['Category'] = np.random.randint(1, 3, df.shape[0])
df.head()

     A    B    C    D  Category
0  417   88  924  844         2
1  647  136   57  680         2
2  223  883  837   56         2
3  346   94   19   80         1
4  635  863  405   29         1

I need to find a subset of n rows (say, 80 rows), which do not significantly differ (p> .05) on the value "C" between two category groups (thus, between categories 1 and 2).

I perform the following t-test to test if the difference is significant:

# t-test
cat1 = df[df['Category']==1]
cat2 = df[df['Category']==2]

ttest_ind(cat1['C'], cat2['D'])

Output:

Ttest_indResult(statistic=-2.004339328381308, pvalue=0.047793084338372295)

Currently, I am doing this manually, using trial and error. I do this by manually picking the subsets, testing them, and then retesting until I find the desired result. I am curious to hear if there is a way to automate this process.

Upvotes: 1

Views: 133

Answers (1)

Raphaele Adjerad
Raphaele Adjerad

Reputation: 1125

Here is my suggestion, by using combinations from itertools as rightfully suggested by @rpanai with groupby and pipethat enables you to get different groups within the same operation. You return a Boolean for the pvalue being above or below threshold 0.05 and you break the loop when Boolean is True:

np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,1000,size=(100, 4)), columns=list('ABCD'))
df['Category'] = np.random.randint(1, 3, df.shape[0])
df.head()

list_iter = [idx for idx in combinations(df.Category.unique(), 2)]
test = dict()

for i, j in list_iter:
    test[(i, j)] = df.groupby("Category").pipe(lambda g: ttest_ind(g["C"].get_group(i), 
                                               g["C"].get_group(j))[1] > 0.05)
    if test[(i, j)]:
        break

Here, in this example, the dictionary test is:

{(2, 1): True}

It works with any numbers of groups, for instance if Category has three groups, with df['Category'] = np.random.randint(1, 4, df.shape[0]), the output for test would look like:

{(2, 3): True}

EDIT : If you want the values of A for a successful test, you can do the following:

list_iter = [idx for idx in combinations(df.Category.unique(), 2)]
test = dict()
for i, j in list_iter:
    test[(i, j)] = df.groupby("Category").pipe(lambda g: ttest_ind(g["C"].get_group(i),
                                                        g["C"].get_group(j))[1] > 0.05)
    if test[(i, j)]:
        output = df.loc[df["Category"].isin([i,j]), ["Category", "A"]]
        break

I replaced D with C because re-reading your question you say you want to compare C across different values of Category. If it is not C but C and D, combinations won't iter across all the groups you want. I also changed the Boolean to being above 0.05, since you want the groups that are not significantly different.

Here I have the following result for test:

{(2, 3): True}

and for output:

   Category    A
0          2  510
1          3  988
2          2  595

You get the values of A for the two categories 2 and 3 where values of C were not significantly different.

Upvotes: 2

Related Questions