Reputation: 645
I have a dataset of house pricing consisting of Region names and monthly sales prices spanning over a range of years. What I want to do is aggregate all these monthly figures into yearly figures so that in the end I am left with yearly sales figures. I tried to convert the monthly columns into DataTime indexes and tried to apply pd.Grouper on them to aggregate them but I can't get them to work.
Here's my code:
data1=pd.read_csv('zillow 1.csv')
col=data1.drop(columns=['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName']).columns # just to get all date-time columns.
for i in col:
data1[i]=pd.to_datetime(data1[i],format=('%Y'))
data1.drop(columns=['RegionID', 'SizeRank', 'RegionType', 'StateName']) # just dropping other columns so that I am left with Region name and monthly figures.
g=data1.groupby(pd.Grouper(col,"Y))['RegionName'].sum()
when I execute this code , it throws me an error.
TypeError: __init__() got multiple values for argument 'freq'
for clarity here's how my monthly columns look :
Index(['1996-01-31', '1996-02-29', '1996-03-31', '1996-04-30', '1996-05-31',
'1996-06-30', '1996-07-31', '1996-08-31', '1996-09-30', '1996-10-31',
...
'2020-05-31', '2020-06-30', '2020-07-31', '2020-08-31', '2020-09-30',
'2020-10-31', '2020-11-30', '2020-12-31', '2021-01-31', '2021-02-28'],
dtype='object', length=302)
Is there any way I can accomplish this? Ultimately my goal is to plot these yearly wise figures so that I can compare sales between different regions grpahically. Here is the dataset link if anyone's interested
Upvotes: 3
Views: 206
Reputation: 8508
Since you want to aggregate the columns by year, here's how I would do it.
Step 1: get all the column names from df that have years. Looking at the data, these columns start from iloc[5:]
Step 2: Get unique years from the set.
Step 3: sum by row using axis=1 if column name starts with year
import pandas as pd
df = pd.read_csv('zillow.csv')
years = sorted({yr[:4] for yr in df.columns[5:].values})
for yr in years:
df[yr] = df[[col for col in df.columns[5:] if col.startswith(yr)]].sum(axis=1)
print (df)
The output of this will be:
RegionID SizeRank ... 2020 2021
0 102001 0 ... 3059372.0 541886.0
1 394913 1 ... 5894078.0 1036463.0
2 753899 2 ... 8587321.0 1516642.0
3 394463 3 ... 2991040.0 523830.0
4 394514 4 ... 3142269.0 553460.0
.. ... ... ... ... ...
907 394767 929 ... 909407.0 158436.0
908 753874 930 ... 2351237.0 401301.0
909 394968 931 ... 1607013.0 264483.0
910 395188 932 ... 816467.0 137119.0
911 394743 933 ... 3729067.0 636767.0
This will give you all the columns as follows:
>>> df.columns
Index(['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName',
'1996-01-31', '1996-02-29', '1996-03-31', '1996-04-30', '1996-05-31',
...
'2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020',
'2021'],
dtype='object', length=333)
With this information, you can then groupby RegionID, SizeRank, RegionName, RegionType, StateName
. Or by whichever column you want.
If you want the groupby RegionName
for all the years, you can just do this.
years += ['RegionName']
print (df[years].groupby(['RegionName']).sum())
The output of this will be:
1996 1997 1998 1999 2000 \
RegionName
Aberdeen, SD 0.0 0.0 0.0 0.0 0.0
Aberdeen, WA 0.0 0.0 0.0 0.0 0.0
Abilene, TX 0.0 0.0 0.0 0.0 0.0
Ada, OK 533030.0 548855.0 586340.0 593326.0 615267.0
Adrian, MI 957140.0 1056303.0 1163536.0 1254465.0 1356705.0
... ... ... ... ... ...
Youngstown, OH 870237.0 898081.0 930793.0 957257.0 990584.0
Yuba City, CA 0.0 0.0 0.0 0.0 1141184.0
Yuma, AZ 0.0 0.0 948628.0 1306891.0 1356049.0
Zanesville, OH 761127.0 774986.0 838063.0 881872.0 922311.0
Zapata, TX 0.0 0.0 0.0 0.0 0.0
2001 2002 2003 2004 2005 \
RegionName
Aberdeen, SD 0.0 0.0 0.0 0.0 1087463.0
Aberdeen, WA 0.0 0.0 0.0 0.0 1323941.0
Abilene, TX 0.0 0.0 0.0 0.0 1070523.0
Ada, OK 642837.0 662776.0 684052.0 705584.0 740637.0
Adrian, MI 1426047.0 1463274.0 1495317.0 1582738.0 1666558.0
... ... ... ... ... ...
Youngstown, OH 1033607.0 1064146.0 1089705.0 1124423.0 1162917.0
Yuba City, CA 1387296.0 1570888.0 1860855.0 2343419.0 2996582.0
Yuma, AZ 1379097.0 1421725.0 1481491.0 1634469.0 2050662.0
Zanesville, OH 981391.0 1002341.0 1037380.0 1091709.0 1137257.0
Zapata, TX 0.0 0.0 0.0 0.0 159482.0
2006 2007 2008 2009 2010 \
RegionName
Aberdeen, SD 1293335.0 1375324.0 1437710.0 1451361.0 1462089.0
Aberdeen, WA 1545565.0 1654038.0 1619901.0 1491200.0 1373750.0
Abilene, TX 1153954.0 1222018.0 1240798.0 1235595.0 1226032.0
Ada, OK 773481.0 794686.0 794048.0 894670.0 934917.0
Adrian, MI 1705913.0 1660779.0 1548574.0 1417074.0 1313150.0
... ... ... ... ... ...
Youngstown, OH 1177239.0 1164396.0 1140398.0 1073945.0 1014500.0
Yuba City, CA 3293180.0 2975985.0 2535911.0 2173688.0 2107412.0
Yuma, AZ 2468443.0 2467547.0 2203446.0 1867022.0 1694541.0
Zanesville, OH 1180517.0 1224656.0 1212803.0 1167160.0 1146663.0
Zapata, TX 666839.0 704868.0 749365.0 725521.0 717505.0
2011 2012 2013 2014 2015 \
RegionName
Aberdeen, SD 1476833.0 1508806.0 1572832.0 1643632.0 1704464.0
Aberdeen, WA 1287545.0 1275520.0 1335709.0 1415693.0 1543013.0
Abilene, TX 1184050.0 1210285.0 1238084.0 1308606.0 1342747.0
Ada, OK 924909.0 965829.0 981066.0 970680.0 1001622.0
Adrian, MI 1250472.0 1237695.0 1285864.0 1352536.0 1427238.0
... ... ... ... ... ...
Youngstown, OH 966189.0 942388.0 950965.0 944779.0 960177.0
Yuba City, CA 2016705.0 2029500.0 2279999.0 2502882.0 2565389.0
Yuma, AZ 1441573.0 1387678.0 1455966.0 1551797.0 1595309.0
Zanesville, OH 1116737.0 1114035.0 1128231.0 1166683.0 1209147.0
Zapata, TX 693401.0 725566.0 759697.0 818673.0 893575.0
2016 2017 2018 2019 2020 \
RegionName
Aberdeen, SD 1791782.0 1898902.0 1994661.0 2060803.0 2093338.0
Aberdeen, WA 1703247.0 1855551.0 2073542.0 2290696.0 2623661.0
Abilene, TX 1418761.0 1474377.0 1556264.0 1623550.0 1684742.0
Ada, OK 1073328.0 1105749.0 1169043.0 1151188.0 1223348.0
Adrian, MI 1525827.0 1657855.0 1750303.0 1829592.0 1925909.0
... ... ... ... ... ...
Youngstown, OH 989771.0 1050918.0 1109259.0 1154718.0 1232026.0
Yuba City, CA 2753120.0 3080021.0 3367719.0 3551897.0 3778631.0
Yuma, AZ 1634934.0 1739666.0 1832909.0 1903329.0 2063665.0
Zanesville, OH 1248745.0 1307947.0 1379258.0 1470346.0 1565903.0
Zapata, TX 948207.0 1016711.0 1049410.0 1096617.0 1117857.0
2021
RegionName
Aberdeen, SD 354763.0
Aberdeen, WA 495262.0
Abilene, TX 296703.0
Ada, OK 209543.0
Adrian, MI 336499.0
... ...
Youngstown, OH 224846.0
Yuba City, CA 674837.0
Yuma, AZ 379388.0
Zanesville, OH 260520.0
Zapata, TX 193910.0
[912 rows x 26 columns]
Upvotes: 1