Reputation: 44013
I have read into a DataFrame an Excel spreadsheet with column names such as Gross, Fee, Net, etc. When I invoke the sum
method on the resulting DataFrame, I saw that it was not summing the Fee column because several rows had string data in that column. So I first loop through each row testing that column to see if it contains a string and if it does, I replace it with a 0. The DataFrame sum
method still does not sum the Fee column. Yet when I write out the resulting DataFrame to a new Excel spreadsheet and read it back in and apply the sum
method to the resulting DataFrame, it does sum the Fee column. Can anyone explain this? Here is the code and the printed output:
import pandas as pd
pp = pd.read_excel('pp.xlsx')
# get rid of any strings in column 'Fee':
for i in range(pp.shape[0]):
if isinstance(pp.loc[i, 'Fee'], str):
pp.loc[i, 'Fee'] = 0
pd.to_numeric(pp['Fee']) #added this but it makes no difference
# the Fee column is still not summed:
print(pp.sum(numeric_only=True))
print('\nSecond Spreadsheet\n')
# write out Dataframe: to an Excel spreadheet:
with pd.ExcelWriter('pp2.xlsx') as writer:
pp.to_excel(writer, sheet_name='PP')
# now read the spreadsheet back into another DataFrame:
pp2 = pd.read_excel('pp2.xlsx')
# the Fee column is summed:
print(pp2.sum(numeric_only=True))
Prints:
Gross 8677.90
Net 8572.43
Address Status 0.00
Shipping and Handling Amount 0.00
Insurance Amount 0.00
Sales Tax 0.00
etc.
Second Spreadsheet
Unnamed: 0 277885.00
Gross 8677.90
Fee -105.47
Net 8572.43
Address Status 0.00
Shipping and Handling Amount 0.00
Insurance Amount 0.00
Sales Tax 0.00
etc.
Upvotes: 0
Views: 1114
Reputation: 472
After a quick analysis, from what I can see is that you are replacing the string with an integer and the values of 'Fee'
column could be a mix of both of float and integer which means the dtype
of that column is an object
. When you do pp.sum(numeric_only=True)
, it ignores the object column because of the condition numeric_only
. Convert your column to a float64
as in pp['Fee'] = pd.to_numeric(pp['Fee'])
and it should work for you.
The reason that it is happening second time is because excel does the data conversion for you and when you read it, it's a numeric
data type.
Upvotes: 1
Reputation: 44013
Everyone who has responded should get partial credit for telling me about pd.to_numeric. But they were all missing one piece. It is not sufficient to say pd.to_numeric(pp['Fee']
. That returns the column converted to numeric but does not update the original DataFrame, so when I do a pp.sum()
, nothing in pp
was modified. You need:
pp['Fee'] = pd.to_numeric(pp['Fee'])
pp.sum()
Upvotes: 0
Reputation: 68116
The problem here is that the Fee column isn't numeric. So you need to convert it to a numeric field, save that updated field in the existing dataframe, and then compute the sum.
So that would be:
df = df.assign(Fee=pd.to_numeric(df['Fee'], errors='coerce'))
print(df.sum())
Upvotes: 1
Reputation: 82755
Try using pd.to_numeric
Ex:
pp = pd.read_excel('pp.xlsx')
print(pd.to_numeric(pp['Fee'], errors='coerce').dropna().sum())
Upvotes: 2