How to resolve pandas summing rows like strings, instead of numbers?

Question

I tried to get sum over the columns by row-wise for my time series data but the sum is weird. I reset_index for date and just take sum for all columns by rows. Can anyone point me out what's going on with this? Any quick thought? Thanks

my attempt:

Here is the data that I used and my attempt here:

import pandas as pd

df = pd.read_csv("https://gist.github.com/jerry-shad/ce26357dcabea22f8db307e5d8a625ff#file-ads_df-csv")

df_grp = df.groupby(['date', 'retail_item'])['number_of_stores'].sum().unstack().reset_index('date')
df_grp.set_index('date', inplace=True)
df_grp.loc[:,'Total'] = df_grp.sum(axis=1)

but Total column is weried, it should get correct sum by using above attempt, but output is wrong. I think something wrong. Can anyone one point me out what is going on here?

here is the current output:

I also tried like this:

df_grp = df.groupby(['date', 'retail_item']).agg({'number_of_stores': 'sum'})
df_grpe_pcts = df_grp.groupby(level=0).apply(lambda x:100 * x / float(x.sum()))
df_grp = df_grp.diff() / df_grp.shift()

main motivation is first group by data by retail_item, then get sum of number_of_stores for all retail_items for each week, then I want to get percentage and percentage change respect to total sum. How can I make this right? Any quick idea to make this work in pandas? thanks

Data Sample

Unnamed: 0,date,region,grade,cut,retail_item,number_of_stores,weighted_avg
40,2016-01-01,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"1,980",4.53
41,2016-01-01,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"4,020",3.65
42,2016-01-01,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,940,2.1
88,2016-01-08,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"3,770",4.76
89,2016-01-08,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"7,770",3.88
90,2016-01-08,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"2,000",2.52
134,2016-01-15,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"6,600",4.69
135,2016-01-15,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"5,640",3.89
136,2016-01-15,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"3,000",2.34
181,2016-01-22,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"1,920",4.79
182,2016-01-22,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"8,830",3.43
183,2016-01-22,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"3,060",2.28
228,2016-01-29,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"2,640",4.2
229,2016-01-29,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"4,420",3.71
230,2016-01-29,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"3,060",2.42
277,2016-02-05,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"4,240",4.87
278,2016-02-05,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"9,820",3.65
279,2016-02-05,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"1,620",2.76
325,2016-02-12,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"4,550",4.88
326,2016-02-12,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"3,540",4.11
327,2016-02-12,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"1,450",2.77
371,2016-02-19,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"3,110",4.84
372,2016-02-19,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"6,270",3.78
373,2016-02-19,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"3,250",2.41
419,2016-02-26,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"3,040",5.04
420,2016-02-26,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"6,420",3.74
421,2016-02-26,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"2,100",2.64
467,2016-03-04,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"3,440",4.74
468,2016-03-04,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"6,040",3.58
469,2016-03-04,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"2,350",2.55

Trenton McKinney · Accepted Answer

As you can see in the image Total is summing strings together.
The columns are not correctly parsed as float type, because of the ,.
The correct way to parse the data, is by using the thousands parameter when reading the data with pandas.read_csv.

import pandas as pd

url = 'https://gist.githubusercontent.com/jerry-shad/ce26357dcabea22f8db307e5d8a625ff/raw/1fee3176f5364d0d08b8f97bae781e16c47cea3d/ads_df.csv'

# specify the thousand parameter when reading the data in
df = pd.read_csv(url, parse_dates=['date'], thousands=',')

# drop the unneeded column
df.drop(columns=['Unnamed: 0'], inplace=True)

# groupby
dfg = df.groupby(['date', 'retail_item'])['number_of_stores'].sum().unstack()

# sum rows
dfg['Total'] = dfg.sum(axis=1)

# display(dfg.head())
retail_item  GROUND BEEF 70-79%  GROUND BEEF 80-89%  GROUND BEEF 90% OR MORE  Total
date                                                                               
2016-01-01                  940                4020                     1980   6940
2016-01-08                 2000                7770                     3770  13540
2016-01-15                 3000                5640                     6600  15240
2016-01-22                 3060                8830                     1920  13810
2016-01-29                 3060                4420                     2640  10120

How to resolve pandas summing rows like strings, instead of numbers?

Data Sample

Answers (2)

Related Questions