How to group multiple dataframe colmuns year-wise?

Question

I have a dataset of house pricing consisting of Region names and monthly sales prices spanning over a range of years. What I want to do is aggregate all these monthly figures into yearly figures so that in the end I am left with yearly sales figures. I tried to convert the monthly columns into DataTime indexes and tried to apply pd.Grouper on them to aggregate them but I can't get them to work.

Here's my code:

data1=pd.read_csv('zillow 1.csv')
col=data1.drop(columns=['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName']).columns # just to get all date-time columns.
for i in col:
    data1[i]=pd.to_datetime(data1[i],format=('%Y'))
data1.drop(columns=['RegionID', 'SizeRank',  'RegionType', 'StateName']) # just dropping other columns so that I am left with Region name and monthly figures.

g=data1.groupby(pd.Grouper(col,"Y))['RegionName'].sum()

when I execute this code , it throws me an error. TypeError: __init__() got multiple values for argument 'freq'

for clarity here's how my monthly columns look :

Index(['1996-01-31', '1996-02-29', '1996-03-31', '1996-04-30', '1996-05-31',
       '1996-06-30', '1996-07-31', '1996-08-31', '1996-09-30', '1996-10-31',
       ...
       '2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31', '2020-09-30',
       '2020-10-31', '2020-11-30', '2020-12-31', '2021-01-31', '2021-02-28'],
      dtype='object', length=302)

Is there any way I can accomplish this? Ultimately my goal is to plot these yearly wise figures so that I can compare sales between different regions grpahically. Here is the dataset link if anyone's interested

Joe Ferndz · Accepted Answer

Since you want to aggregate the columns by year, here's how I would do it.

Step 1: get all the column names from df that have years. Looking at the data, these columns start from iloc[5:]

Step 2: Get unique years from the set.

Step 3: sum by row using axis=1 if column name starts with year

import pandas as pd
df = pd.read_csv('zillow.csv')

years = sorted({yr[:4] for yr in df.columns[5:].values})

for yr in years:
    df[yr] = df[[col for col in df.columns[5:] if col.startswith(yr)]].sum(axis=1)

print (df)

The output of this will be:

     RegionID  SizeRank  ...       2020       2021
0      102001         0  ...  3059372.0   541886.0
1      394913         1  ...  5894078.0  1036463.0
2      753899         2  ...  8587321.0  1516642.0
3      394463         3  ...  2991040.0   523830.0
4      394514         4  ...  3142269.0   553460.0
..        ...       ...  ...        ...        ...
907    394767       929  ...   909407.0   158436.0
908    753874       930  ...  2351237.0   401301.0
909    394968       931  ...  1607013.0   264483.0
910    395188       932  ...   816467.0   137119.0
911    394743       933  ...  3729067.0   636767.0

This will give you all the columns as follows:

>>> df.columns
Index(['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName',
       '1996-01-31', '1996-02-29', '1996-03-31', '1996-04-30', '1996-05-31',
       ...
       '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020',
       '2021'],
      dtype='object', length=333)

With this information, you can then groupby RegionID, SizeRank, RegionName, RegionType, StateName. Or by whichever column you want.

If you want the groupby RegionName for all the years, you can just do this.

years += ['RegionName']

print (df[years].groupby(['RegionName']).sum())

The output of this will be:

                    1996       1997       1998       1999       2000  \
RegionName                                                             
Aberdeen, SD         0.0        0.0        0.0        0.0        0.0   
Aberdeen, WA         0.0        0.0        0.0        0.0        0.0   
Abilene, TX          0.0        0.0        0.0        0.0        0.0   
Ada, OK         533030.0   548855.0   586340.0   593326.0   615267.0   
Adrian, MI      957140.0  1056303.0  1163536.0  1254465.0  1356705.0   
...                  ...        ...        ...        ...        ...   
Youngstown, OH  870237.0   898081.0   930793.0   957257.0   990584.0   
Yuba City, CA        0.0        0.0        0.0        0.0  1141184.0   
Yuma, AZ             0.0        0.0   948628.0  1306891.0  1356049.0   
Zanesville, OH  761127.0   774986.0   838063.0   881872.0   922311.0   
Zapata, TX           0.0        0.0        0.0        0.0        0.0   

                     2001       2002       2003       2004       2005  \
RegionName                                                              
Aberdeen, SD          0.0        0.0        0.0        0.0  1087463.0   
Aberdeen, WA          0.0        0.0        0.0        0.0  1323941.0   
Abilene, TX           0.0        0.0        0.0        0.0  1070523.0   
Ada, OK          642837.0   662776.0   684052.0   705584.0   740637.0   
Adrian, MI      1426047.0  1463274.0  1495317.0  1582738.0  1666558.0   
...                   ...        ...        ...        ...        ...   
Youngstown, OH  1033607.0  1064146.0  1089705.0  1124423.0  1162917.0   
Yuba City, CA   1387296.0  1570888.0  1860855.0  2343419.0  2996582.0   
Yuma, AZ        1379097.0  1421725.0  1481491.0  1634469.0  2050662.0   
Zanesville, OH   981391.0  1002341.0  1037380.0  1091709.0  1137257.0   
Zapata, TX            0.0        0.0        0.0        0.0   159482.0   

                     2006       2007       2008       2009       2010  \
RegionName                                                              
Aberdeen, SD    1293335.0  1375324.0  1437710.0  1451361.0  1462089.0   
Aberdeen, WA    1545565.0  1654038.0  1619901.0  1491200.0  1373750.0   
Abilene, TX     1153954.0  1222018.0  1240798.0  1235595.0  1226032.0   
Ada, OK          773481.0   794686.0   794048.0   894670.0   934917.0   
Adrian, MI      1705913.0  1660779.0  1548574.0  1417074.0  1313150.0   
...                   ...        ...        ...        ...        ...   
Youngstown, OH  1177239.0  1164396.0  1140398.0  1073945.0  1014500.0   
Yuba City, CA   3293180.0  2975985.0  2535911.0  2173688.0  2107412.0   
Yuma, AZ        2468443.0  2467547.0  2203446.0  1867022.0  1694541.0   
Zanesville, OH  1180517.0  1224656.0  1212803.0  1167160.0  1146663.0   
Zapata, TX       666839.0   704868.0   749365.0   725521.0   717505.0   

                     2011       2012       2013       2014       2015  \
RegionName                                                              
Aberdeen, SD    1476833.0  1508806.0  1572832.0  1643632.0  1704464.0   
Aberdeen, WA    1287545.0  1275520.0  1335709.0  1415693.0  1543013.0   
Abilene, TX     1184050.0  1210285.0  1238084.0  1308606.0  1342747.0   
Ada, OK          924909.0   965829.0   981066.0   970680.0  1001622.0   
Adrian, MI      1250472.0  1237695.0  1285864.0  1352536.0  1427238.0   
...                   ...        ...        ...        ...        ...   
Youngstown, OH   966189.0   942388.0   950965.0   944779.0   960177.0   
Yuba City, CA   2016705.0  2029500.0  2279999.0  2502882.0  2565389.0   
Yuma, AZ        1441573.0  1387678.0  1455966.0  1551797.0  1595309.0   
Zanesville, OH  1116737.0  1114035.0  1128231.0  1166683.0  1209147.0   
Zapata, TX       693401.0   725566.0   759697.0   818673.0   893575.0   

                     2016       2017       2018       2019       2020  \
RegionName                                                              
Aberdeen, SD    1791782.0  1898902.0  1994661.0  2060803.0  2093338.0   
Aberdeen, WA    1703247.0  1855551.0  2073542.0  2290696.0  2623661.0   
Abilene, TX     1418761.0  1474377.0  1556264.0  1623550.0  1684742.0   
Ada, OK         1073328.0  1105749.0  1169043.0  1151188.0  1223348.0   
Adrian, MI      1525827.0  1657855.0  1750303.0  1829592.0  1925909.0   
...                   ...        ...        ...        ...        ...   
Youngstown, OH   989771.0  1050918.0  1109259.0  1154718.0  1232026.0   
Yuba City, CA   2753120.0  3080021.0  3367719.0  3551897.0  3778631.0   
Yuma, AZ        1634934.0  1739666.0  1832909.0  1903329.0  2063665.0   
Zanesville, OH  1248745.0  1307947.0  1379258.0  1470346.0  1565903.0   
Zapata, TX       948207.0  1016711.0  1049410.0  1096617.0  1117857.0   

                    2021  
RegionName                
Aberdeen, SD    354763.0  
Aberdeen, WA    495262.0  
Abilene, TX     296703.0  
Ada, OK         209543.0  
Adrian, MI      336499.0  
...                  ...  
Youngstown, OH  224846.0  
Yuba City, CA   674837.0  
Yuma, AZ        379388.0  
Zanesville, OH  260520.0  
Zapata, TX      193910.0  

[912 rows x 26 columns]

How to group multiple dataframe colmuns year-wise?

Answers (1)

Related Questions