Ellie Hanna
Ellie Hanna

Reputation: 58

Python statistics package mean() gives wrong answer from pandas dataframe

I'm using Python 3.6, trying to get the mean of some values in a subset of a row of a pandas dataframe (pandas version 0.23.4). I'm getting the values with .loc[] and then trying to get the mean of them with mean() from the python statistics package, like so:

import statistics as st
rows = ['row1','row2','row3']
somelist = []
for i in rows:
    a = df.loc[i,"Q1":"Q7"]
    somelist.append(st.mean(a))

I end up getting answers without any decimal places. If I manually write in the answers to items Q1:Q7 into a list, this is the result:

a = st.mean([2,3,4,4,2,6,5])
print(a)
Out: 3.7142857142857144

But if that sequence was what I pulled from the dataframe, I get a mean with no decimal places, like so:

a = st.mean(df.loc[i,"Q1":"Q7"])
Out: 3

Evidently it's because it thinks it's a numpy.int64 instead of a float. This happens even if I convert the slice from the dataframe into a list, like this:

a = st.mean(list(df.loc[i,"Q1":"Q7"]))
Out: 3

Weirdly, it does NOT happen if I use .mean() :

a = df.loc[i,"Q1":"Q7"].mean()
Out: 3.7142857142857144

I double-checked the st.stdev() method and it seems to work fine. What's going on? Why does it want to print out an integer for the mean automatically? Thanks!

Upvotes: 0

Views: 1286

Answers (2)

Warren Weckesser
Warren Weckesser

Reputation: 114811

statistics.mean converts the output to the same type as the inputs. If the input values are all, say, numpy.int64, the result is converted to numpy.int64. Here's the source for statistics.mean in Python 3.6.7:

def mean(data):
    """Return the sample arithmetic mean of data.

    >>> mean([1, 2, 3, 4, 4])
    2.8

    >>> from fractions import Fraction as F
    >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
    Fraction(13, 21)

    >>> from decimal import Decimal as D
    >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
    Decimal('0.5625')

    If ``data`` is empty, StatisticsError will be raised.
    """
    if iter(data) is data:
        data = list(data)
    n = len(data)
    if n < 1:
        raise StatisticsError('mean requires at least one data point')
    T, total, count = _sum(data)
    assert count == n
    return _convert(total/n, T)

Note that total/n is converted to the input type before being returned.

To avoid this, you could convert the input to floating point before passing it to statistics.mean.

Upvotes: 1

Marko Maksimovic
Marko Maksimovic

Reputation: 87

I think you are doing the things in the for part wrong. Try printing the a for each row that you are going trough and the appended mean in the list.

Upvotes: 0

Related Questions