Vichtor
Vichtor

Reputation: 197

Looping through a pandas DataFrame accessing previous elements

I have two DataFrames, FirstColumn & SecondColumn.

How do I create a new column, containing the correlation coeff. row by row for the two columns 5 periods back?

For example, the 5th row would be the R2 value of the two columns 5 periods back, the 6th row would be the corr coeff. value of the columns ranging from row 1-6 etc etc.

Additionally, what method is the most efficient when looping through a DataFrame, having to access previous rows?

FirstColumn SecondColumn
0   2   1.0
1   3   3.0
2   4   4.0
3   5   5.0
4   6   2.0
5   7   6.0
6   2   2.0
7   3   3.0
8   5   9.0
9   3   2.0
10  2   3.0
11  4   2.0
12  2   2.0
13  4   2.0
14  2   4.0
15  5   3.0
16  3   1.0

Upvotes: 0

Views: 69

Answers (2)

Georgina Skibinski
Georgina Skibinski

Reputation: 13397

You can do:

df["corr"]=df.rolling(5, min_periods=1).corr()["FirstColumn"].loc[(slice(None), "SecondColumn")]

Outputs:

    FirstColumn  SecondColumn      corr
0           2.0           1.0       NaN
1           3.0           3.0  1.000000
2           4.0           4.0  0.981981
3           5.0           5.0  0.982708
4           6.0           2.0  0.400000
5           7.0           6.0  0.400000
6           2.0           2.0  0.566707
7           3.0           3.0  0.610572
8           5.0           9.0  0.426961
9           3.0           2.0  0.737804
10          2.0           3.0  0.899659
11          4.0           2.0  0.698774
12          2.0           2.0  0.716769
13          4.0           2.0 -0.559017
14          2.0           4.0 -0.612372
15          5.0           3.0 -0.250000
16          3.0           1.0 -0.067267

Upvotes: 2

Pete
Pete

Reputation: 121

You can use the shift(n) method to access the element n rows back. One approach would be to create "lag" columns, like so:

for i in range(5):
    df['FirstCol_lag'+str(i)] = df.FirstColumn.shift(i)

Then you can do your formula operations on a row-by-row basis, e.g.

df['R2'] = foo([df.FirstCol_lag1, ... df.SecondCol_lag5])

The most efficient approach would be to not use a loop and do it this way. But if the data is very large this may not be feasible. I think the iterrows() function is pretty efficient too, you can test which is faster if you really care. For that you'd have to offset the row index manually and it would take more code.

Still you'll have to be careful about handling nans because the shift will be null for the first n columns of your dataframe.

Upvotes: 1

Related Questions