Flib
Flib

Reputation: 175

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:

[Date] [Asset] [1DRet]
jan   asset1   0.52
jan   asset2   0.12
jan   asset3   0.07
feb   asset1   0.41
feb   asset2   0.33
feb   asset3   0.21

...

So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3

Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.

I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe

I see some example but it's always for a dataframe where you have an asset per column and one row per day.

This my code block where I'm doing it:

qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"

result = conn.execute(qryRet)

df = pd.DataFrame(data=list(result),columns=result.keys())

df1d = df[['Date','Id_RiskFactor','1DReturn']]

corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)

conn.close()

For it I'm reciving this msg:

corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'

**Print(df1d.head())**
         Date  Id_RiskFactor         1DReturn
0  2010-09-02            149            0E-12
1  2010-09-02            150  -0.004242875148
2  2010-09-02             33   0.000590000011
3  2010-09-02             28   0.000099999997
4  2010-09-02             34  -0.000010000000

**print(df.head())**
         Date  Id_RiskFactor           Value         1DReturn         5DReturn
0  2010-09-02            149  0.040096000000            0E-12            0E-12
1  2010-09-02            150  1.736700000000  -0.004242875148  -0.013014321215
2  2010-09-02             33  2.283000000000   0.000590000011   0.001260000048
3  2010-09-02             28  2.113000000000   0.000099999997   0.000469999999
4  2010-09-02             34  0.615000000000  -0.000010000000   0.000079999998

**print(corr.columns)**
Index([], dtype='object')

Upvotes: 2

Views: 2336

Answers (1)

Alexander
Alexander

Reputation: 109520

Create a sample DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({'daily_return': np.random.random(15), 
                   'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5, 
                   'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})

>>> df
    daily_return       date symbol
0       0.011467 2015-01-01      A
1       0.613518 2015-01-02      A
2       0.334343 2015-01-03      A
3       0.371809 2015-01-04      A
4       0.169016 2015-01-05      A
5       0.431729 2015-01-01      B
6       0.474905 2015-01-02      B
7       0.372366 2015-01-03      B
8       0.801619 2015-01-04      B
9       0.505487 2015-01-05      B
10      0.946504 2015-01-01      C
11      0.337204 2015-01-02      C
12      0.798704 2015-01-03      C
13      0.311597 2015-01-04      C
14      0.545215 2015-01-05      C

I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.

corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()  
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2         A         B         C
symbol_1                              
A         1.000000  0.188065 -0.745115
B         0.188065  1.000000 -0.688808
C        -0.745115 -0.688808  1.000000

You can select the subset of your DataFrame based on dates as follows:

start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
    daily_return       date symbol
3       0.371809 2015-01-04      A
4       0.169016 2015-01-05      A
8       0.801619 2015-01-04      B
9       0.505487 2015-01-05      B
13      0.311597 2015-01-04      C
14      0.545215 2015-01-05      C

If you want to flatten your correlation matrix:

corr.stack().reset_index()
  symbol_1 symbol_2         0
0        A        A  1.000000
1        A        B  0.188065
2        A        C -0.745115
3        B        A  0.188065
4        B        B  1.000000
5        B        C -0.688808
6        C        A -0.745115
7        C        B -0.688808
8        C        C  1.000000

Upvotes: 1

Related Questions