Reputation: 2735
I am trying to find the difference in quarterly balance sheet equity values from the following data frame:
import pandas as pd
import numpy as np
df2= pd.DataFrame({'FirmID' : pd.Series(['ID001', 'ID001', 'ID001', 'ID001', 'ID001', 'ID001', 'ID001', 'ID001', 'ID001', 'ID001' ]),
'RSSD9999' : pd.Series([20060331, 20060630, 20060930, 20061231, 20070331,20070630, 20070930, 20080630, 20080930, 20081231]),
'year' : pd.Series([2006, 2006, 2006, 2006, 2007, 2007, 2007, 2008, 2008, 2008 ]),
'Q' : pd.Series([1, 2, 3, 4, 1, 2, 3, 2, 3, 4 ]),
'EquityEoQ' : pd.Series([112, 223, 333, 445, 126, 251, 376, 291, 291, 503 ]),
'NewEqRight': pd.Series([112, 111, 110, 112, 126, 125, 125, np.nan, 0 , 212, ])})
df2=df2[['FirmID','RSSD9999', 'year', 'Q', 'EquityEoQ','NewEqRight']]
The frame shows equity values at the end of quarters per year: EquityEoQ
. Note, NewEqRight
shows the values as they should be, with missing values for 2007Q4 and 2008Q2.
I can find the change in equity per quarter by taking the difference between row values. For example, firm ID001 issued 111 new equity in 2006Q2 (111 = 223 - 112).
If all quarterly rows in the data are present, then I can use shift
to create a new column with last quarter's equity (EquityEoLastQ), and another column that records the difference between EquityEoQ and EquityEoLastQ to obtain the change in equity:
df2['EquityEoLastQ'] = df2.groupby(['FirmID'])['EquityEoQ'].shift(1)
df2['NewEqWrong'] = df2['EquityEoQ']-df2['EquityEoLastQ']
df2.loc[df2['Q']==1, 'NewEqWrong'] = df2.loc[df2['Q']==1, 'EquityEoQ']
The last row corrects values for Q1.
But if quarterly rows are missing, then shift
messes up. For example, in the data-frame, the rows for 2007Q4 and 2008Q1 are missing. This leads to incorrect information, because shift
refers to the wrong quarter. In this frame this approach gives a negative NewEqWrong
value for 2008Q2 is -85.0, which is the wrong value.
Desired data set:
In [9]: df2
Out[9]:
FirmID RSSD9999 year Q EquityEoQ NewEqRight EquityEoLastQ NewEqWrong
0 ID001 20060331 2006 1 112 112.0 NaN 112.0
1 ID001 20060630 2006 2 223 111.0 112.0 111.0
2 ID001 20060930 2006 3 333 110.0 223.0 110.0
3 ID001 20061231 2006 4 445 112.0 333.0 112.0
4 ID001 20070331 2007 1 126 126.0 445.0 126.0
5 ID001 20070630 2007 2 251 125.0 126.0 125.0
6 ID001 20070930 2007 3 376 125.0 251.0 125.0
7 ID001 20080630 2008 2 291 NaN 376.0 -85.0
8 ID001 20080930 2008 3 291 0.0 291.0 0.0
9 ID001 20081231 2008 4 503 212.0 291.0 212.0
In Stata one can set the time series frequency to quarterly, and then use L. or D. to find lags and differences respectively.
Is there any way pandas can deal with this problem, more or less like Stata?
Upvotes: 1
Views: 1189
Reputation: 2735
Basically, I realised that I want to merge data. It then works as desired:
See this link: lag values and differences and missing quarterly data
Upvotes: 0
Reputation: 210852
IIUC you can do it this way:
In [48]: df2
Out[48]:
EquityEoQ FirmID Q RSSD9999 year
0 112 ID001 1 20060331 2006
1 223 ID001 2 20060630 2006
2 333 ID001 3 20060930 2006
3 445 ID001 4 20061231 2006
4 126 ID001 1 20070331 2007
5 251 ID001 2 20070630 2007
6 376 ID001 3 20070930 2007
7 291 ID001 2 20080630 2008
8 291 ID001 3 20080930 2008
9 503 ID001 4 20081231 2008
In [49]: df2['NewEquity'] = \
df2.sort_values(['year','Q']).groupby(['FirmID','year'])['EquityEoQ'].diff()
In [50]: df2
Out[50]:
EquityEoQ FirmID Q RSSD9999 year NewEquity
0 112 ID001 1 20060331 2006 NaN
1 223 ID001 2 20060630 2006 111.0
2 333 ID001 3 20060930 2006 110.0
3 445 ID001 4 20061231 2006 112.0
4 126 ID001 1 20070331 2007 NaN
5 251 ID001 2 20070630 2007 125.0
6 376 ID001 3 20070930 2007 125.0
7 291 ID001 2 20080630 2008 NaN
8 291 ID001 3 20080930 2008 0.0
9 503 ID001 4 20081231 2008 212.0
Upvotes: 1