Reputation: 327
I have dataframe that looks like:
df:
colA colB
A 0.97
A 0.67
A 0.32
B 0.98
B 0.81
t,p= stats.ttest_ind(group["colA"],group["colB"])
It throws an error:
TypeError: unsupported operand type(s) for /: 'str' and 'int
Upvotes: 0
Views: 3194
Reputation: 13999
The description of scipy.stats.ttest_ind
from the docs:
Calculate the T-test for the means of two independent samples of scores.
The problem you're running into is that while the values in 'colB'
are indeed a valid example of possible "scores", the values in 'colA'
are not: they're just letters. There's no way you can do a t-test between a group of numbers and a group of letters. They're just not comparable like that. Internally, ttest_ind
at some point attempts to divide the values in 'colA'
by the values in 'colB'
, which causes the error.
If your values in the first column are meant to represent success and failure, then you're in a situation where one of your groups is binary valued while the other is continuously valued. In these kinds of cases, a more appropriate approach is to perform a logistic regression. You would then get your p value using the Wald test. If the values in the first column represent a categorical variable, you would do a multinomial logistic regression instead.
First you'd convert your first column in your dataframe to ones and zeros. Given that A
is success and B
is failure, here's how you perform the conversion:
df['colA'] = df['colA'].replace({'A':1, 'B':0})
You'll have to install the statsmodels
package for this next part (if you have pip
, just run pip install statsmodels
), but the package makes it super easy to perform a logistic regression. You should consult the statsmodels.discrete.discrete_model.Logit
docs if you have any questions about how to use it.
Here's a basic example:
import statsmodels.api as sm
df['intercept'] = 1.0
logit_model=sm.Logit(df['colA'], df[df.columns[1:]])
result=logit_model.fit()
print(result.summary())
you'll then get an output like:
Optimization terminated successfully.
Current function value: 0.528480
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: colA No. Observations: 5
Model: Logit Df Residuals: 3
Method: MLE Df Model: 1
Date: Tue, 18 Dec 2018 Pseudo R-squ.: 0.2148
Time: 11:10:54 Log-Likelihood: -2.6424
converged: True LL-Null: -3.3651
LLR p-value: 0.2293
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
colB -6.1702 6.722 -0.918 0.359 -19.345 7.004
intercept 5.3452 5.791 0.923 0.356 -6.005 16.695
==============================================================================
Not a very good p value, but there's only 5 datapoints, so that's what you'd expect. Assuming there's much more data in your real dataframe, you'll likely get better results.
Upvotes: 3