Reputation: 223
I tried to get sum over the columns by row-wise for my time series data but the sum is weird. I reset_index
for date
and just take sum for all columns by rows. Can anyone point me out what's going on with this? Any quick thought? Thanks
my attempt:
Here is the data that I used and my attempt here:
import pandas as pd
df = pd.read_csv("https://gist.github.com/jerry-shad/ce26357dcabea22f8db307e5d8a625ff#file-ads_df-csv")
df_grp = df.groupby(['date', 'retail_item'])['number_of_stores'].sum().unstack().reset_index('date')
df_grp.set_index('date', inplace=True)
df_grp.loc[:,'Total'] = df_grp.sum(axis=1)
but Total
column is weried, it should get correct sum by using above attempt, but output is wrong. I think something wrong. Can anyone one point me out what is going on here?
here is the current output:
I also tried like this:
df_grp = df.groupby(['date', 'retail_item']).agg({'number_of_stores': 'sum'})
df_grpe_pcts = df_grp.groupby(level=0).apply(lambda x:100 * x / float(x.sum()))
df_grp = df_grp.diff() / df_grp.shift()
main motivation is first group by data by retail_item
, then get sum of number_of_stores
for all retail_items
for each week, then I want to get percentage and percentage change respect to total sum. How can I make this right? Any quick idea to make this work in pandas? thanks
Unnamed: 0,date,region,grade,cut,retail_item,number_of_stores,weighted_avg
40,2016-01-01,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"1,980",4.53
41,2016-01-01,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"4,020",3.65
42,2016-01-01,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,940,2.1
88,2016-01-08,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"3,770",4.76
89,2016-01-08,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"7,770",3.88
90,2016-01-08,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"2,000",2.52
134,2016-01-15,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"6,600",4.69
135,2016-01-15,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"5,640",3.89
136,2016-01-15,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"3,000",2.34
181,2016-01-22,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"1,920",4.79
182,2016-01-22,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"8,830",3.43
183,2016-01-22,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"3,060",2.28
228,2016-01-29,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"2,640",4.2
229,2016-01-29,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"4,420",3.71
230,2016-01-29,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"3,060",2.42
277,2016-02-05,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"4,240",4.87
278,2016-02-05,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"9,820",3.65
279,2016-02-05,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"1,620",2.76
325,2016-02-12,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"4,550",4.88
326,2016-02-12,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"3,540",4.11
327,2016-02-12,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"1,450",2.77
371,2016-02-19,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"3,110",4.84
372,2016-02-19,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"6,270",3.78
373,2016-02-19,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"3,250",2.41
419,2016-02-26,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"3,040",5.04
420,2016-02-26,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"6,420",3.74
421,2016-02-26,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"2,100",2.64
467,2016-03-04,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 90% OR MORE,"3,440",4.74
468,2016-03-04,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 80-89%,"6,040",3.58
469,2016-03-04,NATIONAL,SUMMARY,GRND BEEF,GROUND BEEF 70-79%,"2,350",2.55
Upvotes: 1
Views: 214
Reputation: 62523
Total
is summing strings together.float
type, because of the ,
.thousands
parameter when reading the data with pandas.read_csv
.import pandas as pd
url = 'https://gist.githubusercontent.com/jerry-shad/ce26357dcabea22f8db307e5d8a625ff/raw/1fee3176f5364d0d08b8f97bae781e16c47cea3d/ads_df.csv'
# specify the thousand parameter when reading the data in
df = pd.read_csv(url, parse_dates=['date'], thousands=',')
# drop the unneeded column
df.drop(columns=['Unnamed: 0'], inplace=True)
# groupby
dfg = df.groupby(['date', 'retail_item'])['number_of_stores'].sum().unstack()
# sum rows
dfg['Total'] = dfg.sum(axis=1)
# display(dfg.head())
retail_item GROUND BEEF 70-79% GROUND BEEF 80-89% GROUND BEEF 90% OR MORE Total
date
2016-01-01 940 4020 1980 6940
2016-01-08 2000 7770 3770 13540
2016-01-15 3000 5640 6600 15240
2016-01-22 3060 8830 1920 13810
2016-01-29 3060 4420 2640 10120
Upvotes: 4
Reputation: 16172
You're adding strings together. You probably want the number of stores as an integer. Try this immediately after creating the df, and see if your code starts working.
df['number_of_stores'] = df['number_of_stores'].str.replace(',','').astype(int)
Upvotes: 0