Asim
Asim

Reputation: 1480

pandas check if two values are statistically different

I have a pandas dataframe which has some values for Male and some for Female. I would like to calculate if the percentage of both genders' values is significantly different or not and tell confidence intervals of these rates. Given below is the sample code:

data={}
data['gender']=['male','female','female','male','female','female','male','female','male']
data['values']=[10,2,13,4,11,8,14,19,2]
df_new=pd.DataFrame(data)
df_new.head()   # make a simple data frame


    gender  values
0   male    10
1   female  2
2   female  13
3   male    4
4   female  11

df_male=df_new.loc[df_new['gender']=='male']
df_female=df_new.loc[df_new['gender']=='female']   # separate male and female

# calculate percentages
male_percentage=sum(df_male['values'].values)*100/sum(df_new['values'].values)
female_percentage=sum(df_female['values'].values)*100/sum(df_new['values'].values)

# want to tell whether both percentages are statistically different or not and what are their confidence interval rates
print(male_percentage)
print(female_percentage)

Any help will be much appreciated. Thanks!

Upvotes: 4

Views: 4979

Answers (2)

wwnde
wwnde

Reputation: 26676

Use t-test.In this case, use a two t test, meaning you are comparing values/means of two samples.

I am applying an alternative hypothesis; A!=B. I do this by testing the null hypothesis A=B. This is achieved by calculating a p value. When p falls below a critical value, called alpha, I reject the null hypothesis. Standard value for alpha is 0.05. Below 5% probability, the sample will produce patterns similar to observed values

Extract Samples, in this case a list of values

A=df[df['gender']=='male']['values'].values.tolist()
B=df[df['gender']=='female']['values'].values.tolist()

Using scipy library, do the t -test

from scipy import stats
t_check=stats.ttest_ind(A,B)
t_check
alpha=0.05
if(t_check[1]<alpha):
    print('A different from B')

Upvotes: 5

Ukrainian-serge
Ukrainian-serge

Reputation: 854

Try this:

df_new.groupby('gender')['values'].sum()/df_new['values'].sum()*100

gender
female    63.855422
male      36.144578
Name: values, dtype: float64

Upvotes: 0

Related Questions