Booboo
Booboo

Reputation: 44013

pandas is not summing a numeric column

I have read into a DataFrame an Excel spreadsheet with column names such as Gross, Fee, Net, etc. When I invoke the sum method on the resulting DataFrame, I saw that it was not summing the Fee column because several rows had string data in that column. So I first loop through each row testing that column to see if it contains a string and if it does, I replace it with a 0. The DataFrame sum method still does not sum the Fee column. Yet when I write out the resulting DataFrame to a new Excel spreadsheet and read it back in and apply the sum method to the resulting DataFrame, it does sum the Fee column. Can anyone explain this? Here is the code and the printed output:

import pandas as pd

pp = pd.read_excel('pp.xlsx')
# get rid of any strings in column 'Fee':
for i in range(pp.shape[0]):
    if isinstance(pp.loc[i, 'Fee'], str):
        pp.loc[i, 'Fee'] = 0
pd.to_numeric(pp['Fee']) #added this but it makes no difference
# the Fee column is still not summed:
print(pp.sum(numeric_only=True))

print('\nSecond Spreadsheet\n')

# write out Dataframe: to an Excel spreadheet:
with pd.ExcelWriter('pp2.xlsx') as writer:
    pp.to_excel(writer, sheet_name='PP')
# now read the spreadsheet back into another DataFrame:
pp2 = pd.read_excel('pp2.xlsx')
# the Fee column is summed:
print(pp2.sum(numeric_only=True))

Prints:

Gross                                                          8677.90
Net                                                            8572.43
Address Status                                                    0.00
Shipping and Handling Amount                                      0.00
Insurance Amount                                                  0.00
Sales Tax                                                         0.00
etc.

Second Spreadsheet

Unnamed: 0                                                     277885.00
Gross                                                            8677.90
Fee                                                              -105.47
Net                                                              8572.43
Address Status                                                      0.00
Shipping and Handling Amount                                        0.00
Insurance Amount                                                    0.00
Sales Tax                                                           0.00
etc.

Upvotes: 0

Views: 1114

Answers (4)

Danny
Danny

Reputation: 472

After a quick analysis, from what I can see is that you are replacing the string with an integer and the values of 'Fee' column could be a mix of both of float and integer which means the dtype of that column is an object. When you do pp.sum(numeric_only=True) , it ignores the object column because of the condition numeric_only. Convert your column to a float64 as in pp['Fee'] = pd.to_numeric(pp['Fee']) and it should work for you.

The reason that it is happening second time is because excel does the data conversion for you and when you read it, it's a numeric data type.

Upvotes: 1

Booboo
Booboo

Reputation: 44013

Everyone who has responded should get partial credit for telling me about pd.to_numeric. But they were all missing one piece. It is not sufficient to say pd.to_numeric(pp['Fee']. That returns the column converted to numeric but does not update the original DataFrame, so when I do a pp.sum(), nothing in pp was modified. You need:

pp['Fee'] = pd.to_numeric(pp['Fee'])
pp.sum()

Upvotes: 0

Paul H
Paul H

Reputation: 68116

The problem here is that the Fee column isn't numeric. So you need to convert it to a numeric field, save that updated field in the existing dataframe, and then compute the sum.

So that would be:

df = df.assign(Fee=pd.to_numeric(df['Fee'], errors='coerce'))
print(df.sum())

Upvotes: 1

Rakesh
Rakesh

Reputation: 82755

Try using pd.to_numeric

Ex:

pp = pd.read_excel('pp.xlsx')
print(pd.to_numeric(pp['Fee'], errors='coerce').dropna().sum())

Upvotes: 2

Related Questions