Reputation: 197
I have two DataFrames, FirstColumn
& SecondColumn
.
How do I create a new column, containing the correlation coeff. row by row for the two columns 5 periods back?
For example, the 5th row would be the R2 value of the two columns 5 periods back, the 6th row would be the corr coeff. value of the columns ranging from row 1-6 etc etc.
Additionally, what method is the most efficient when looping through a DataFrame, having to access previous rows?
FirstColumn SecondColumn
0 2 1.0
1 3 3.0
2 4 4.0
3 5 5.0
4 6 2.0
5 7 6.0
6 2 2.0
7 3 3.0
8 5 9.0
9 3 2.0
10 2 3.0
11 4 2.0
12 2 2.0
13 4 2.0
14 2 4.0
15 5 3.0
16 3 1.0
Upvotes: 0
Views: 69
Reputation: 13397
You can do:
df["corr"]=df.rolling(5, min_periods=1).corr()["FirstColumn"].loc[(slice(None), "SecondColumn")]
Outputs:
FirstColumn SecondColumn corr
0 2.0 1.0 NaN
1 3.0 3.0 1.000000
2 4.0 4.0 0.981981
3 5.0 5.0 0.982708
4 6.0 2.0 0.400000
5 7.0 6.0 0.400000
6 2.0 2.0 0.566707
7 3.0 3.0 0.610572
8 5.0 9.0 0.426961
9 3.0 2.0 0.737804
10 2.0 3.0 0.899659
11 4.0 2.0 0.698774
12 2.0 2.0 0.716769
13 4.0 2.0 -0.559017
14 2.0 4.0 -0.612372
15 5.0 3.0 -0.250000
16 3.0 1.0 -0.067267
Upvotes: 2
Reputation: 121
You can use the shift(n)
method to access the element n
rows back. One approach would be to create "lag" columns, like so:
for i in range(5):
df['FirstCol_lag'+str(i)] = df.FirstColumn.shift(i)
Then you can do your formula operations on a row-by-row basis, e.g.
df['R2'] = foo([df.FirstCol_lag1, ... df.SecondCol_lag5])
The most efficient approach would be to not use a loop and do it this way. But if the data is very large this may not be feasible. I think the iterrows()
function is pretty efficient too, you can test which is faster if you really care. For that you'd have to offset the row index manually and it would take more code.
Still you'll have to be careful about handling nan
s because the shift will be null for the first n
columns of your dataframe.
Upvotes: 1