Reputation: 941
I have used Redfin
real estate data where monthly house sale price along the years are recorded for each region in Chicago area. I want to calculate yearly average house sale price of the city first, meanwhile, I also want to get yearly house sale price change for each region, then I want to compare the yearly sale price change of each region with respective avg yearly house sale price in the city and I want to introduce new columns which have binary values (1, 0) for each region along the year. If the house sale price change of each region is greater than average yearly house sale price change of the change, add 1, 0 otherwise.
for instance, between 2012-Feb ~ 2013 Feb, yearly house sale price change in Austin region is 5 %, avg yearly house sale price in Chicago area is 7 %, so I can add value 0
into price_label
column.
How can I make this sort of aggregation easily for time series data? any way to get this done?
I posted my question several times meanwhile I tried own my own and didn't get correct output. can anyone point me out how to get correct solution? Thanks
example data:
dicts = {'Region': {0: 'Chicago, IL metro area',
1: 'Chicago, IL',
2: 'Chicago, IL - Albany Park',
3: 'Chicago, IL - Andersonville'},
Timestamp('2012-02-01 00:00:00'): {0: 88.4, 1: 95.1, 2: 76.8, 3: 193.4},
Timestamp('2012-03-01 00:00:00'): {0: 93.3, 1: 103.6, 2: 77.9, 3: 169.2},
Timestamp('2012-04-01 00:00:00'): {0: 97.6, 1: 120.4, 2: 80.9, 3: 157.3},
Timestamp('2012-05-01 00:00:00'): {0: 102.0, 1: 130.6, 2: 98.4, 3: 156.8},
Timestamp('2012-06-01 00:00:00'): {0: 110.7, 1: 150.8, 2: 109.8, 3: 175.4},
Timestamp('2012-07-01 00:00:00'): {0: 109.3, 1: 133.6, 2: 102.6, 3: 188.8},
Timestamp('2012-08-01 00:00:00'): {0: 106.9, 1: 140.5, 2: 89.0, 3: 194.8},
Timestamp('2012-09-01 00:00:00'): {0: 103.4, 1: 137.5, 2: 87.5, 3: 206.9},
Timestamp('2012-10-01 00:00:00'): {0: 98.5, 1: 121.4, 2: 98.7, 3: 209.2},
Timestamp('2012-11-01 00:00:00'): {0: 97.8, 1: 125.0, 2: 94.1, 3: 211.5},
Timestamp('2012-12-01 00:00:00'): {0: 97.1, 1: 120.9, 2: 93.3, 3: 183.8},
Timestamp('2013-01-01 00:00:00'): {0: 94.4, 1: 110.7, 2: 89.4, 3: 181.4},
Timestamp('2013-02-01 00:00:00'): {0: 91.1, 1: 104.8, 2: 95.1, 3: 177.2},
Timestamp('2013-03-01 00:00:00'): {0: 94.7, 1: 123.0, 2: 94.9, 3: 180.6},
Timestamp('2013-04-01 00:00:00'): {0: 100.9, 1: 126.8, 2: 101.4, 3: 203.0},
Timestamp('2013-05-01 00:00:00'): {0: 109.3, 1: 156.1, 2: 127.9, 3: 218.0},
Timestamp('2013-06-01 00:00:00'): {0: 116.8, 1: 165.2, 2: 125.0, 3: 218.0},
Timestamp('2013-07-01 00:00:00'): {0: 120.8, 1: 168.2, 2: 120.8, 3: 220.3},
Timestamp('2013-08-01 00:00:00'): {0: 119.8, 1: 164.7, 2: 113.6, 3: 208.3},
Timestamp('2013-09-01 00:00:00'): {0: 114.2, 1: 158.5, 2: 115.3, 3: 209.7},
Timestamp('2013-10-01 00:00:00'): {0: 116.0, 1: 156.9, 2: 127.9, 3: 205.4},
Timestamp('2013-11-01 00:00:00'): {0: 110.0, 1: 135.3, 2: 130.5, 3: 215.0},
Timestamp('2013-12-01 00:00:00'): {0: 112.6, 1: 146.0, 2: 126.6, 3: 212.5},
Timestamp('2014-01-01 00:00:00'): {0: 105.2, 1: 127.9, 2: 112.3, 3: 205.7},
Timestamp('2014-02-01 00:00:00'): {0: 104.2, 1: 126.9, 2: 106.7, 3: 202.9},
Timestamp('2014-03-01 00:00:00'): {0: 107.1, 1: 138.5, 2: 114.3, 3: 200.0},
Timestamp('2014-04-01 00:00:00'): {0: 114.8, 1: 155.9, 2: 119.3, 3: 210.9},
Timestamp('2014-05-01 00:00:00'): {0: 120.1, 1: 179.4, 2: 134.5, 3: 215.4},
Timestamp('2014-06-01 00:00:00'): {0: 126.4, 1: 186.8, 2: 141.5, 3: 225.5},
Timestamp('2014-07-01 00:00:00'): {0: 127.2, 1: 187.5, 2: 152.1, 3: 225.5},
Timestamp('2014-08-01 00:00:00'): {0: 128.8, 1: 186.1, 2: 156.9, 3: 222.1},
Timestamp('2014-09-01 00:00:00'): {0: 122.2, 1: 183.3, 2: 145.1, 3: 213.2},
Timestamp('2014-10-01 00:00:00'): {0: 120.8, 1: 161.6, 2: 147.7, 3: 228.8},
Timestamp('2014-11-01 00:00:00'): {0: 116.7, 1: 151.3, 2: 144.4, 3: 226.3},
Timestamp('2014-12-01 00:00:00'): {0: 117.2, 1: 154.0, 2: 145.1, 3: 238.8},
Timestamp('2015-01-01 00:00:00'): {0: 113.4, 1: 145.8, 2: 137.2, 3: 221.6},
Timestamp('2015-02-01 00:00:00'): {0: 108.7, 1: 139.8, 2: 140.7, 3: 232.0}}
here is example data snippet of time series data in the dictionary:
my attempt:
import numpy as np
import pandas as pd
df_= pd.DataFrame([dicts.keys(), dicts.values()])
df_.columns=df_.columns.astype(str)
house_df=house_df.set_index('Region')
house_df.columns=pd.to_datetime(df_.columns)
def ratio(df):
return np.exp(np.log(df).diff()) - 1
keys = ['Region']
pd.concat([df_, df_.groupby('Region')[df_.columns].apply(ratio)],
axis=1, keys=keys)
but above attempt didn't return correct expected aggregation result. what should I do? any idea to make this happen? I tried many way but still didn't get what I want. can anyone point me out how to make this right?
update
alternatively, I want to compare the monthly changes across the years to the yearly average changes for each region. Any possible idea to make this aggregation happens? Thanks
desired output
I want to get dataframe where yearly house price percentage of each individual district would be added as new columns if house price change of individual city is greater than average yearly house price change of the city, then I am going to add binary value such as 1, 0 otherwise.
expected_output = pd.DataFrame({'Year': ['2012', '2013', '2014', '2015', '2012', '2013', '2014', '2015', '2012', '2013', '2014', '2015'],
'Area': ['Chicago, IL metro area', 'Chicago, IL metro area', 'Chicago, IL metro area', 'Chicago, IL metro area', 'Chicago, IL', 'Chicago, IL', 'Chicago, IL', 'Chicago, IL', 'Chicago, IL - Albany Park', 'Chicago, IL - Albany Park', 'Chicago, IL - Albany Park', 'Chicago, IL - Albany Park'],'yearly_price_change': ['5%', '10%', '7%','21%', '15%', '12%', '2%','21%', '10%', '11%', '12%','6%'],
'price_label':[0, 1, 0,1,1,1,0,1,1,1,1,0]})
any idea to get this done? how can I get correct aggregation like my expected dataframe? how can I make this happen? any thoughts? thanks
Upvotes: 2
Views: 84
Reputation: 150805
Here's my take:
# prepare the data frame
df = pd.DataFrame(dicts).set_index('Region')
df.columns = pd.to_datetime(df.columns)
df = df.stack().reset_index()
df.columns = ['Region', 'date', 'price']
df.head()
# Region date price
#-- ---------------------- ------------------- -------
# 0 Chicago, IL metro area 2012-02-01 00:00:00 88.4
# 1 Chicago, IL metro area 2012-03-01 00:00:00 93.3
# 2 Chicago, IL metro area 2012-04-01 00:00:00 97.6
# 3 Chicago, IL metro area 2012-05-01 00:00:00 102
# 4 Chicago, IL metro area 2012-06-01 00:00:00 110.7
# get the price change over month, as I understand from the question
df['price_change'] = df.groupby('Region').price.apply(lambda x: x.diff().abs()/x)
# compute mean over the years and regions
new_df = df.groupby(['Region', df.date.dt.year])[['price_change']].mean()
# compute the price_label
new_df['price_label'] = new_df.groupby(level=0).apply(lambda x: (x>x.mean()).astype(int))
new_df
# price_change
#date Region
#2012 Chicago, IL 0.082864
# Chicago, IL - Albany Park 0.074394
# Chicago, IL - Andersonville 0.066074
# Chicago, IL metro area 0.035153
#2013 Chicago, IL 0.074208
# Chicago, IL - Albany Park 0.055192
# Chicago, IL - Andersonville 0.032249
# Chicago, IL metro area 0.040750
#2014 Chicago, IL 0.063483
# Chicago, IL - Albany Park 0.056466
# Chicago, IL - Andersonville 0.030612
# Chicago, IL metro area 0.032648
#2015 Chicago, IL 0.049580
# Chicago, IL - Albany Park 0.041228
# Chicago, IL - Andersonville 0.061222
# Chicago, IL metro area 0.038374
#Name: price_change, dtype: float64
# here we compute the average across the years for each region
# groupby(level=1) will gather all the records of same region (level 1)
# if you want average across the regions for each year,
# change to groupby(level=0), i.e. gather all records of same year.
new_df['price_label'] = new_df.groupby(level=1).apply(lambda x: (x>x.mean()).astype(int))
new_df
Output:
+------------------------------+-------+---------------+-------------+
| | | price_change | price_label |
+------------------------------+-------+---------------+-------------+
| Region | date | | |
+------------------------------+-------+---------------+-------------+
| Chicago, IL | 2012 | 0.082864 | 1 |
| | 2013 | 0.074208 | 1 |
| | 2014 | 0.063483 | 0 |
| | 2015 | 0.049580 | 0 |
| Chicago, IL - Albany Park | 2012 | 0.074394 | 1 |
| | 2013 | 0.055192 | 0 |
| | 2014 | 0.056466 | 0 |
| | 2015 | 0.041228 | 0 |
| Chicago, IL - Andersonville | 2012 | 0.066074 | 1 |
| | 2013 | 0.032249 | 0 |
| | 2014 | 0.030612 | 0 |
| | 2015 | 0.061222 | 1 |
| Chicago, IL metro area | 2012 | 0.035153 | 0 |
| | 2013 | 0.040750 | 1 |
| | 2014 | 0.032648 | 0 |
| | 2015 | 0.038374 | 1 |
+------------------------------+-------+---------------+-------------+
I may misunderstand something, but that's the gist :-).
Upvotes: 4