TjS
TjS

Reputation: 327

Type error unsupported operand type(s) for /: 'str' and 'int' while performing stats.ttest_ind

I have dataframe that looks like:

  df:
  colA      colB
  A         0.97
  A         0.67
  A         0.32
  B         0.98
  B         0.81

t,p= stats.ttest_ind(group["colA"],group["colB"])

It throws an error:

TypeError: unsupported operand type(s) for /: 'str' and 'int

Upvotes: 0

Views: 3194

Answers (1)

tel
tel

Reputation: 13999

The problem

The description of scipy.stats.ttest_ind from the docs:

Calculate the T-test for the means of two independent samples of scores.

The problem you're running into is that while the values in 'colB' are indeed a valid example of possible "scores", the values in 'colA' are not: they're just letters. There's no way you can do a t-test between a group of numbers and a group of letters. They're just not comparable like that. Internally, ttest_ind at some point attempts to divide the values in 'colA' by the values in 'colB', which causes the error.

The solution

If your values in the first column are meant to represent success and failure, then you're in a situation where one of your groups is binary valued while the other is continuously valued. In these kinds of cases, a more appropriate approach is to perform a logistic regression. You would then get your p value using the Wald test. If the values in the first column represent a categorical variable, you would do a multinomial logistic regression instead.

First you'd convert your first column in your dataframe to ones and zeros. Given that A is success and B is failure, here's how you perform the conversion:

df['colA'] = df['colA'].replace({'A':1, 'B':0})

You'll have to install the statsmodels package for this next part (if you have pip, just run pip install statsmodels), but the package makes it super easy to perform a logistic regression. You should consult the statsmodels.discrete.discrete_model.Logit docs if you have any questions about how to use it.

Here's a basic example:

import statsmodels.api as sm
df['intercept'] = 1.0

logit_model=sm.Logit(df['colA'], df[df.columns[1:]])
result=logit_model.fit()
print(result.summary())

you'll then get an output like:

Optimization terminated successfully.
         Current function value: 0.528480
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                   colA   No. Observations:                    5
Model:                          Logit   Df Residuals:                        3
Method:                           MLE   Df Model:                            1
Date:                Tue, 18 Dec 2018   Pseudo R-squ.:                  0.2148
Time:                        11:10:54   Log-Likelihood:                -2.6424
converged:                       True   LL-Null:                       -3.3651
                                        LLR p-value:                    0.2293
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
colB          -6.1702      6.722     -0.918      0.359     -19.345       7.004
intercept      5.3452      5.791      0.923      0.356      -6.005      16.695
==============================================================================

Not a very good p value, but there's only 5 datapoints, so that's what you'd expect. Assuming there's much more data in your real dataframe, you'll likely get better results.

Upvotes: 3

Related Questions