Reputation: 21

Calculate mean for selected rows for selected columns in pandas data frame，but end up with some weird number

I am trying to find the mean for certain column of the data frame in python, but then I ended up with some really weird number. Can someone explain this to me? I want the mean for column a,b,c

k = pd.DataFrame(np.array([[1, 0, 3,'kk'], [4, 5, 6,'kk'], [7, 20, 9,'k'],[3, 2, 9,'k']]),
                   columns=['a', 'b', 'c','type'])
k

which returns

    a   b   c   type
0   1   0   3   kk
1   4   5   6   kk
2   7   20  9   k
3   3   2   9   k

I want the mean for each column except the column 'type'

 k[['a','b','c']].mean()

and this give me

a     368.25
b    1300.50
c     924.75
dtype: float64

I am so confused! Can someone explain this to me ?

Upvotes: 2

Answers (3)

Saravanakumar V

Reputation: 166

This is the problem with creating the numpy array with mixed datatypes. Each sub-lists are now have a data type of Object and the same is being converted into data frame.

So, now DataFrame also will hold the same data type as in array.

See the below snippet:

k = pd.DataFrame(np.array([[1, 0, 3,'kk'], [4, 5, 6,'kk'], [7, 20, 9,'k'],[3, 2, 9,'k']]),
                   columns=['a', 'b', 'c','type'])

print(k.dtypes)

a       object
b       object
c       object
type    object
dtype: object

But you can think, how the mean is getting calculated over the string objects. This is again the power of numpy.

For example, take column a:

when you apply mean, it is trying the below operation,

np.sum(array) / len(array)

print(np.sum(k["a"]))

'1473'

print(np.len(k["a"]))

4

print(np.mean(k["a"]))

368.25

Now, 368.25 is nothing but 1473 / 4.

For Column b, it will be 05202 / 4 = 1300.5.

So, when you create a Dataframe, create with list of lists or in a dictionary form which will assign the data types according to the elements.

k = pd.DataFrame(([[1, 0, 3,'kk'], [4, 5, 6,'kk'], [7, 20, 9,'k'],[3, 2, 9,'k']]),
                   columns=['a', 'b', 'c','type'])

print(k.dtypes)

a        int64
b        int64
c        int64
type    object
dtype: object


print(k.mean())

a    3.75
b    6.75
c    6.75
dtype: float64

Upvotes: 3

Hunted

Reputation: 88

If we look at the type of variables stored in the dataframe, we find that they're stored as objects.

print(k.dtypes)

a    object
b    object
c    object
d    object
dtype: object

This means that since you've stored a string in one of the columns, the entire dataframe is being stored as objects. There is a number associated to each character, and I believe you're getting a mean of some of those numbers (although I haven't been able to figure out how you got those numbers).

For example, if we look at the numerical value assigned to the string '0' :

ord('0')

We see it has the numerical value of 48.

In order to get the mean you're looking for, you'll need to change the type.

Try :

b = k[['a', 'b', 'c']].astype(int)
print(b.mean())

a    3.75
b    6.75
c    6.75
dtype: float64

edit : changed "strings" to "objects"

Upvotes: 2

ombk

Reputation: 2111

The problem in your data is that you are mixing numbers with non numbers which is the 'k' for type.

Therefore your dataframe has type OBJECT and not integers.

Now I can't really explain on the low level how the numbers are generating such answer, however, the solution is:

TLDR;

k[['a','b','c']].astype(int).mean()

Output:

a    3.75
b    6.75
c    6.75
dtype: float64

And Welcome!

Upvotes: 1

Calculate mean for selected rows for selected columns in pandas data frame，but end up with some weird number

Answers (3)

Related Questions