Jerry07
Jerry07

Reputation: 941

any quick way to get correct aggregation output for time series data using pandas?

I have used Redfin real estate data where monthly house sale price along the years are recorded for each region in Chicago area. I want to calculate yearly average house sale price of the city first, meanwhile, I also want to get yearly house sale price change for each region, then I want to compare the yearly sale price change of each region with respective avg yearly house sale price in the city and I want to introduce new columns which have binary values (1, 0) for each region along the year. If the house sale price change of each region is greater than average yearly house sale price change of the change, add 1, 0 otherwise.

for instance, between 2012-Feb ~ 2013 Feb, yearly house sale price change in Austin region is 5 %, avg yearly house sale price in Chicago area is 7 %, so I can add value 0 into price_label column.

How can I make this sort of aggregation easily for time series data? any way to get this done?

I posted my question several times meanwhile I tried own my own and didn't get correct output. can anyone point me out how to get correct solution? Thanks

example data:

dicts = {'Region': {0: 'Chicago, IL metro area',
  1: 'Chicago, IL',
  2: 'Chicago, IL - Albany Park',
  3: 'Chicago, IL - Andersonville'},
 Timestamp('2012-02-01 00:00:00'): {0: 88.4, 1: 95.1, 2: 76.8, 3: 193.4},
 Timestamp('2012-03-01 00:00:00'): {0: 93.3, 1: 103.6, 2: 77.9, 3: 169.2},
 Timestamp('2012-04-01 00:00:00'): {0: 97.6, 1: 120.4, 2: 80.9, 3: 157.3},
 Timestamp('2012-05-01 00:00:00'): {0: 102.0, 1: 130.6, 2: 98.4, 3: 156.8},
 Timestamp('2012-06-01 00:00:00'): {0: 110.7, 1: 150.8, 2: 109.8, 3: 175.4},
 Timestamp('2012-07-01 00:00:00'): {0: 109.3, 1: 133.6, 2: 102.6, 3: 188.8},
 Timestamp('2012-08-01 00:00:00'): {0: 106.9, 1: 140.5, 2: 89.0, 3: 194.8},
 Timestamp('2012-09-01 00:00:00'): {0: 103.4, 1: 137.5, 2: 87.5, 3: 206.9},
 Timestamp('2012-10-01 00:00:00'): {0: 98.5, 1: 121.4, 2: 98.7, 3: 209.2},
 Timestamp('2012-11-01 00:00:00'): {0: 97.8, 1: 125.0, 2: 94.1, 3: 211.5},
 Timestamp('2012-12-01 00:00:00'): {0: 97.1, 1: 120.9, 2: 93.3, 3: 183.8},
 Timestamp('2013-01-01 00:00:00'): {0: 94.4, 1: 110.7, 2: 89.4, 3: 181.4},
 Timestamp('2013-02-01 00:00:00'): {0: 91.1, 1: 104.8, 2: 95.1, 3: 177.2},
 Timestamp('2013-03-01 00:00:00'): {0: 94.7, 1: 123.0, 2: 94.9, 3: 180.6},
 Timestamp('2013-04-01 00:00:00'): {0: 100.9, 1: 126.8, 2: 101.4, 3: 203.0},
 Timestamp('2013-05-01 00:00:00'): {0: 109.3, 1: 156.1, 2: 127.9, 3: 218.0},
 Timestamp('2013-06-01 00:00:00'): {0: 116.8, 1: 165.2, 2: 125.0, 3: 218.0},
 Timestamp('2013-07-01 00:00:00'): {0: 120.8, 1: 168.2, 2: 120.8, 3: 220.3},
 Timestamp('2013-08-01 00:00:00'): {0: 119.8, 1: 164.7, 2: 113.6, 3: 208.3},
 Timestamp('2013-09-01 00:00:00'): {0: 114.2, 1: 158.5, 2: 115.3, 3: 209.7},
 Timestamp('2013-10-01 00:00:00'): {0: 116.0, 1: 156.9, 2: 127.9, 3: 205.4},
 Timestamp('2013-11-01 00:00:00'): {0: 110.0, 1: 135.3, 2: 130.5, 3: 215.0},
 Timestamp('2013-12-01 00:00:00'): {0: 112.6, 1: 146.0, 2: 126.6, 3: 212.5},
 Timestamp('2014-01-01 00:00:00'): {0: 105.2, 1: 127.9, 2: 112.3, 3: 205.7},
 Timestamp('2014-02-01 00:00:00'): {0: 104.2, 1: 126.9, 2: 106.7, 3: 202.9},
 Timestamp('2014-03-01 00:00:00'): {0: 107.1, 1: 138.5, 2: 114.3, 3: 200.0},
 Timestamp('2014-04-01 00:00:00'): {0: 114.8, 1: 155.9, 2: 119.3, 3: 210.9},
 Timestamp('2014-05-01 00:00:00'): {0: 120.1, 1: 179.4, 2: 134.5, 3: 215.4},
 Timestamp('2014-06-01 00:00:00'): {0: 126.4, 1: 186.8, 2: 141.5, 3: 225.5},
 Timestamp('2014-07-01 00:00:00'): {0: 127.2, 1: 187.5, 2: 152.1, 3: 225.5},
 Timestamp('2014-08-01 00:00:00'): {0: 128.8, 1: 186.1, 2: 156.9, 3: 222.1},
 Timestamp('2014-09-01 00:00:00'): {0: 122.2, 1: 183.3, 2: 145.1, 3: 213.2},
 Timestamp('2014-10-01 00:00:00'): {0: 120.8, 1: 161.6, 2: 147.7, 3: 228.8},
 Timestamp('2014-11-01 00:00:00'): {0: 116.7, 1: 151.3, 2: 144.4, 3: 226.3},
 Timestamp('2014-12-01 00:00:00'): {0: 117.2, 1: 154.0, 2: 145.1, 3: 238.8},
 Timestamp('2015-01-01 00:00:00'): {0: 113.4, 1: 145.8, 2: 137.2, 3: 221.6},
 Timestamp('2015-02-01 00:00:00'): {0: 108.7, 1: 139.8, 2: 140.7, 3: 232.0}}

here is example data snippet of time series data in the dictionary:

my attempt:

import numpy as np
import pandas as pd

df_= pd.DataFrame([dicts.keys(), dicts.values()])
df_.columns=df_.columns.astype(str)
house_df=house_df.set_index('Region')
house_df.columns=pd.to_datetime(df_.columns)

def ratio(df):
    return np.exp(np.log(df).diff()) - 1

keys = ['Region']
pd.concat([df_, df_.groupby('Region')[df_.columns].apply(ratio)],
          axis=1, keys=keys)

but above attempt didn't return correct expected aggregation result. what should I do? any idea to make this happen? I tried many way but still didn't get what I want. can anyone point me out how to make this right?

update

alternatively, I want to compare the monthly changes across the years to the yearly average changes for each region. Any possible idea to make this aggregation happens? Thanks

desired output

I want to get dataframe where yearly house price percentage of each individual district would be added as new columns if house price change of individual city is greater than average yearly house price change of the city, then I am going to add binary value such as 1, 0 otherwise.

expected_output = pd.DataFrame({'Year': ['2012', '2013', '2014', '2015', '2012', '2013', '2014', '2015', '2012', '2013', '2014', '2015'], 
                     'Area': ['Chicago, IL metro area', 'Chicago, IL metro area', 'Chicago, IL metro area', 'Chicago, IL metro area', 'Chicago, IL', 'Chicago, IL', 'Chicago, IL', 'Chicago, IL', 'Chicago, IL - Albany Park', 'Chicago, IL - Albany Park', 'Chicago, IL - Albany Park', 'Chicago, IL - Albany Park'],'yearly_price_change': ['5%', '10%', '7%','21%', '15%', '12%', '2%','21%', '10%', '11%', '12%','6%'],
                     'price_label':[0, 1, 0,1,1,1,0,1,1,1,1,0]})

enter image description here

any idea to get this done? how can I get correct aggregation like my expected dataframe? how can I make this happen? any thoughts? thanks

Upvotes: 2

Views: 84

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150805

Here's my take:

# prepare the data frame
df = pd.DataFrame(dicts).set_index('Region')
df.columns = pd.to_datetime(df.columns)

df = df.stack().reset_index()
df.columns = ['Region', 'date', 'price']
df.head()

#    Region                  date                   price
#--  ----------------------  -------------------  -------
# 0  Chicago, IL metro area  2012-02-01 00:00:00     88.4
# 1  Chicago, IL metro area  2012-03-01 00:00:00     93.3
# 2  Chicago, IL metro area  2012-04-01 00:00:00     97.6
# 3  Chicago, IL metro area  2012-05-01 00:00:00    102
# 4  Chicago, IL metro area  2012-06-01 00:00:00    110.7

# get the price change over month, as I understand from the question
df['price_change'] = df.groupby('Region').price.apply(lambda x: x.diff().abs()/x)

# compute mean over the years and regions
new_df = df.groupby(['Region', df.date.dt.year])[['price_change']].mean()

# compute the price_label
new_df['price_label'] = new_df.groupby(level=0).apply(lambda x: (x>x.mean()).astype(int))
new_df

#                                     price_change
#date  Region                     
#2012  Chicago, IL                    0.082864
#      Chicago, IL - Albany Park      0.074394
#      Chicago, IL - Andersonville    0.066074
#      Chicago, IL metro area         0.035153
#2013  Chicago, IL                    0.074208
#      Chicago, IL - Albany Park      0.055192
#      Chicago, IL - Andersonville    0.032249
#      Chicago, IL metro area         0.040750
#2014  Chicago, IL                    0.063483
#      Chicago, IL - Albany Park      0.056466
#      Chicago, IL - Andersonville    0.030612
#      Chicago, IL metro area         0.032648
#2015  Chicago, IL                    0.049580
#      Chicago, IL - Albany Park      0.041228
#      Chicago, IL - Andersonville    0.061222
#      Chicago, IL metro area         0.038374
#Name: price_change, dtype: float64

# here we compute the average across the years for each region
# groupby(level=1) will gather all the records of same region (level 1)
# if you want average across the regions for each year,
# change to groupby(level=0), i.e. gather all records of same year.
new_df['price_label'] = new_df.groupby(level=1).apply(lambda x: (x>x.mean()).astype(int))

new_df

Output:

+------------------------------+-------+---------------+-------------+
|                              |       | price_change  | price_label |
+------------------------------+-------+---------------+-------------+
| Region                       | date  |               |             |
+------------------------------+-------+---------------+-------------+
| Chicago, IL                  | 2012  | 0.082864      |           1 |
|                              | 2013  | 0.074208      |           1 |
|                              | 2014  | 0.063483      |           0 |
|                              | 2015  | 0.049580      |           0 |
| Chicago, IL - Albany Park    | 2012  | 0.074394      |           1 |
|                              | 2013  | 0.055192      |           0 |
|                              | 2014  | 0.056466      |           0 |
|                              | 2015  | 0.041228      |           0 |
| Chicago, IL - Andersonville  | 2012  | 0.066074      |           1 |
|                              | 2013  | 0.032249      |           0 |
|                              | 2014  | 0.030612      |           0 |
|                              | 2015  | 0.061222      |           1 |
| Chicago, IL metro area       | 2012  | 0.035153      |           0 |
|                              | 2013  | 0.040750      |           1 |
|                              | 2014  | 0.032648      |           0 |
|                              | 2015  | 0.038374      |           1 |
+------------------------------+-------+---------------+-------------+

I may misunderstand something, but that's the gist :-).

Upvotes: 4

Related Questions