Danish
Danish

Reputation: 2871

Year wise aggregation on the given condition in pandas

I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.

 product     price      sale_date       discount
   A         50         2016-12-01      5
   A         50         2017-01-03      4
   B         200        2016-12-24      10
   A         50         2017-01-18      3
   B         200        2017-01-28      15
   A         50         2017-01-18      6
   B         200        2017-01-28      20
   A         50         2017-04-18      6
   B         200        2017-12-08      25
   A         50         2017-11-18      6
   B         200        2017-08-21      20
   B         200        2017-12-28      30
   A         50         2018-03-18      10
   B         300        2018-06-08      45
   B         300        2018-09-20      50
   A         50         2018-11-18      8
   B         300        2018-11-28      35

From the above I would like to prepare below data frame

Expected Output:

product    year     number_of_months     total_price total_discount number_of_sales
A          2016     1                    50            5            1
B          2016     1                    200           10           1
A          2017     12                   250           25           5
B          2017     12                   1000          110          5  
A          2018     11                   100           18           2
B          2018     11                   900           130          3

Note: Please note that the data starts from Dec 2016 to Nov 2018. So number of months in 2016 is 1, in 2017 we have full data so 12 months and 2018 we have 11 months.

Upvotes: 1

Views: 75

Answers (1)

jezrael
jezrael

Reputation: 862661

First aggregate sum by years and product and then create new column for counts by months by DataFrame.insert and Series.map:

df1 =(df.groupby(['product',df['sale_date'].dt.year], sort=False).sum().add_prefix('total_')
        .reset_index())

df1.insert(2,'number_of_months', df1['sale_date'].map({2016:1, 2017:12, 2018:11}))
print (df1)

  product  sale_date  number_of_months  total_price  total_discount
0       A       2016                 1           50               5
1       A       2017                12          250              25
2       B       2016                 1          200              10
3       B       2017                12         1000             110
4       A       2018                11          100              18
5       B       2018                11          900             130

If want dynamic dictionary by minumal and maximal datetimes use:

s = pd.date_range(df['sale_date'].min(), df['sale_date'].max(), freq='MS')

d = s.year.value_counts().to_dict()
print (d)
{2017: 12, 2018: 11, 2016: 1}

df1 = (df.groupby(['product',df['sale_date'].dt.year], sort=False).sum().add_prefix('total_')
        .reset_index())

df1.insert(2,'number_of_months', df1['sale_date'].map(d))
print (df1)
  product  sale_date  number_of_months  total_price  total_discount
0       A       2016                 1           50               5
1       A       2017                12          250              25
2       B       2016                 1          200              10
3       B       2017                12         1000             110
4       A       2018                11          100              18
5       B       2018                11          900             130

For ploting is used DataFrame.set_index with DataFrame.unstack:

df2 = (df1.set_index(['sale_date','product'])[['total_price','total_discount']]
         .unstack(fill_value=0))
df2.columns = df2.columns.map('_'.join)
print (df2)
          total_price_A  total_price_B  total_discount_A  total_discount_B
sale_date                                                                  
2016                  50            200                 5                10
2017                 250           1000                25               110
2018                 100            900                18               130

df2.plot()

EDIT:

df1 = (df.groupby(['product',df['sale_date'].dt.year], sort=False)
        .agg( total_price=('price','sum'),
             total_discount=('discount','sum'),
             number_of_sales=('discount','size'))
        .reset_index())

df1.insert(2,'number_of_months', df1['sale_date'].map({2016:1, 2017:12, 2018:11}))
print (df1)
  product  sale_date  number_of_months  total_price  total_discount  \
0       A       2016               NaN           50               5   
1       A       2017               NaN          250              25   
2       B       2016               NaN          200              10   
3       B       2017               NaN         1000             110   
4       A       2018               NaN          100              18   
5       B       2018               NaN          900             130   

   number_of_sales  
0                1  
1                5  
2                1  
3                5  
4                2  
5                3  

Upvotes: 1

Related Questions