Reputation: 766
I have the following dataframe, which, for the sake of this example, is full of random numbers:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
df = pd.DataFrame(np.random.randint(0,1000,size=(100, 4)), columns=list('ABCD'))
df['Category'] = np.random.randint(1, 3, df.shape[0])
df.head()
A B C D Category
0 417 88 924 844 2
1 647 136 57 680 2
2 223 883 837 56 2
3 346 94 19 80 1
4 635 863 405 29 1
I need to find a subset of n rows (say, 80 rows), which do not significantly differ (p> .05) on the value "C" between two category groups (thus, between categories 1 and 2).
I perform the following t-test to test if the difference is significant:
# t-test
cat1 = df[df['Category']==1]
cat2 = df[df['Category']==2]
ttest_ind(cat1['C'], cat2['D'])
Output:
Ttest_indResult(statistic=-2.004339328381308, pvalue=0.047793084338372295)
Currently, I am doing this manually, using trial and error. I do this by manually picking the subsets, testing them, and then retesting until I find the desired result. I am curious to hear if there is a way to automate this process.
Upvotes: 1
Views: 133
Reputation: 1125
Here is my suggestion, by using combinations
from itertools
as rightfully suggested by @rpanai with groupby
and pipe
that enables you to get different groups within the same operation. You return a Boolean for the pvalue being above or below threshold 0.05 and you break the loop when Boolean is True:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,1000,size=(100, 4)), columns=list('ABCD'))
df['Category'] = np.random.randint(1, 3, df.shape[0])
df.head()
list_iter = [idx for idx in combinations(df.Category.unique(), 2)]
test = dict()
for i, j in list_iter:
test[(i, j)] = df.groupby("Category").pipe(lambda g: ttest_ind(g["C"].get_group(i),
g["C"].get_group(j))[1] > 0.05)
if test[(i, j)]:
break
Here, in this example, the dictionary test
is:
{(2, 1): True}
It works with any numbers of groups, for instance if Category has three groups, with df['Category'] = np.random.randint(1, 4, df.shape[0])
, the output for test
would look like:
{(2, 3): True}
EDIT : If you want the values of A for a successful test, you can do the following:
list_iter = [idx for idx in combinations(df.Category.unique(), 2)]
test = dict()
for i, j in list_iter:
test[(i, j)] = df.groupby("Category").pipe(lambda g: ttest_ind(g["C"].get_group(i),
g["C"].get_group(j))[1] > 0.05)
if test[(i, j)]:
output = df.loc[df["Category"].isin([i,j]), ["Category", "A"]]
break
I replaced D
with C
because re-reading your question you say you want to compare C
across different values of Category
. If it is not C
but C
and D
, combinations
won't iter across all the groups you want.
I also changed the Boolean to being above 0.05, since you want the groups that are not significantly different.
Here I have the following result for test
:
{(2, 3): True}
and for output
:
Category A
0 2 510
1 3 988
2 2 595
You get the values of A
for the two categories 2
and 3
where values of C
were not significantly different.
Upvotes: 2