StackUser
StackUser

Reputation: 142

Why is call to sum() on a data frame generating wrong numbers?

I want to sum the numerical values in each row (Store A to Store D) for the month of June and place them in an appended column 'Sum'. But the results generate very huge sum values which are wrong. How to get correct sum?

This code was run using Python 3.6 :

import pandas as pd
import numpy as np

data = np.array([
        ['', 'week','storeA','storeB','storeC','storeD'],
        [0,"2014-05-04",2643,8257,3893,6231],
        [1,"2014-05-11",6444,5736,5634,7092],
        [2,"2014-05-18",9646,2552,4253,5447],
        [3,"2014-05-25",5960,10740,8264,6063],
        [4,"2014-06-04",5960,10740,8264,6063],
        [5,"2014-06-12",7412,7374,3208,3985]
       ])
df= pd.DataFrame(data=data[1:,1:],
                  index=data[1:,0],
                  columns=data[0,1:])
print(df)

# get rows of table which match Year,Month for last month
df2 = df[df['week'].str.contains("2014-06")].copy()
print(df2)

# generate col summing up each row
col_list = list(df2)
print(col_list)
col_list.remove('week')
print(col_list)
df2['Sum'] = df2[col_list].sum(axis=1)
print(df2)

Output of Sum column for rows 4 and 5:

Row4 - 5.960107e+16

Row5 - 7.412737e+15

Upvotes: 3

Views: 1112

Answers (2)

cs95
cs95

Reputation: 403012

Because of the way your array is defined, with mixed strings and objects, everything is coerced to string. Take a look at this:

df.dtypes

week      object
storeA    object
storeB    object
storeC    object
storeD    object
dtype: object

You have columns of strings, and sum on string dataframes results in concatenation.

The solution is to convert these to integers first -

df2[col_list] = df2[col_list].astype(int) 

Your code then works.

df2[col_list].sum(axis=1)

4    31027
5    21979
dtype: int64

Alternatively, declare data as a object array -

data = np.array([[...], [...], ...], dtype=object)
df = pd.DataFrame(data=data[1:,1:], index=data[1:,0], columns=data[0,1:])

Next, perform a soft conversion using infer_objects (new in v0.22):

df = df.infer_objects()
df.dtypes

week      object
storeA     int64
storeB     int64
storeC     int64
storeD     int64
dtype: object

Works like a charm.

Upvotes: 1

Scott Boston
Scott Boston

Reputation: 153510

Use astype, to convert those strings to ints and sum works properly:

df2['Sum'] = df2[col_list].astype(int).sum(axis=1)

Output:

         week storeA storeB storeC storeD    Sum
4  2014-06-04   5960  10740   8264   6063  31027
5  2014-06-12   7412   7374   3208   3985  21979

What was happening,you were summing (concatenating) strings.

Upvotes: 3

Related Questions