Martien Lubberink
Martien Lubberink

Reputation: 2735

How to determine the difference in quarterly row values of a pandas dataframe when quarters are missing

I am trying to find the difference in quarterly balance sheet equity values from the following data frame:

import pandas as pd
import numpy as np

df2= pd.DataFrame({'FirmID'    : pd.Series(['ID001',  'ID001',  'ID001',  'ID001',  'ID001', 'ID001',  'ID001',   'ID001',  'ID001',  'ID001' ]), 
                   'RSSD9999'  : pd.Series([20060331, 20060630, 20060930, 20061231, 20070331,20070630, 20070930,  20080630, 20080930, 20081231]), 
                   'year'      : pd.Series([2006,     2006,     2006,     2006,     2007,    2007,     2007,      2008,     2008,     2008    ]),
                   'Q'         : pd.Series([1,        2,        3,        4,        1,       2,        3,         2,        3,        4       ]), 
                   'EquityEoQ' : pd.Series([112,      223,      333,      445,      126,     251,      376,       291,      291,      503     ]),
                   'NewEqRight': pd.Series([112,      111,      110,      112,      126,      125,      125,      np.nan,      0  ,      212,    ])})
df2=df2[['FirmID','RSSD9999', 'year', 'Q', 'EquityEoQ','NewEqRight']]

The frame shows equity values at the end of quarters per year: EquityEoQ. Note, NewEqRight shows the values as they should be, with missing values for 2007Q4 and 2008Q2.

I can find the change in equity per quarter by taking the difference between row values. For example, firm ID001 issued 111 new equity in 2006Q2 (111 = 223 - 112).

If all quarterly rows in the data are present, then I can use shift to create a new column with last quarter's equity (EquityEoLastQ), and another column that records the difference between EquityEoQ and EquityEoLastQ to obtain the change in equity:

df2['EquityEoLastQ'] = df2.groupby(['FirmID'])['EquityEoQ'].shift(1)
df2['NewEqWrong']     = df2['EquityEoQ']-df2['EquityEoLastQ']
df2.loc[df2['Q']==1, 'NewEqWrong'] = df2.loc[df2['Q']==1, 'EquityEoQ']

The last row corrects values for Q1.

But if quarterly rows are missing, then shift messes up. For example, in the data-frame, the rows for 2007Q4 and 2008Q1 are missing. This leads to incorrect information, because shift refers to the wrong quarter. In this frame this approach gives a negative NewEqWrong value for 2008Q2 is -85.0, which is the wrong value.

Desired data set:

In [9]: df2
Out[9]:
  FirmID  RSSD9999  year  Q  EquityEoQ  NewEqRight  EquityEoLastQ  NewEqWrong
0  ID001  20060331  2006  1        112       112.0            NaN       112.0
1  ID001  20060630  2006  2        223       111.0          112.0       111.0
2  ID001  20060930  2006  3        333       110.0          223.0       110.0
3  ID001  20061231  2006  4        445       112.0          333.0       112.0
4  ID001  20070331  2007  1        126       126.0          445.0       126.0
5  ID001  20070630  2007  2        251       125.0          126.0       125.0
6  ID001  20070930  2007  3        376       125.0          251.0       125.0
7  ID001  20080630  2008  2        291         NaN          376.0       -85.0
8  ID001  20080930  2008  3        291         0.0          291.0         0.0
9  ID001  20081231  2008  4        503       212.0          291.0       212.0

In Stata one can set the time series frequency to quarterly, and then use L. or D. to find lags and differences respectively.

Is there any way pandas can deal with this problem, more or less like Stata?

Upvotes: 1

Views: 1189

Answers (2)

Martien Lubberink
Martien Lubberink

Reputation: 2735

Basically, I realised that I want to merge data. It then works as desired:

See this link: lag values and differences and missing quarterly data

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210852

IIUC you can do it this way:

In [48]: df2
Out[48]:
   EquityEoQ FirmID  Q  RSSD9999  year
0        112  ID001  1  20060331  2006
1        223  ID001  2  20060630  2006
2        333  ID001  3  20060930  2006
3        445  ID001  4  20061231  2006
4        126  ID001  1  20070331  2007
5        251  ID001  2  20070630  2007
6        376  ID001  3  20070930  2007
7        291  ID001  2  20080630  2008
8        291  ID001  3  20080930  2008
9        503  ID001  4  20081231  2008

In [49]: df2['NewEquity'] = \
             df2.sort_values(['year','Q']).groupby(['FirmID','year'])['EquityEoQ'].diff()

In [50]: df2
Out[50]:
   EquityEoQ FirmID  Q  RSSD9999  year  NewEquity
0        112  ID001  1  20060331  2006        NaN
1        223  ID001  2  20060630  2006      111.0
2        333  ID001  3  20060930  2006      110.0
3        445  ID001  4  20061231  2006      112.0
4        126  ID001  1  20070331  2007        NaN
5        251  ID001  2  20070630  2007      125.0
6        376  ID001  3  20070930  2007      125.0
7        291  ID001  2  20080630  2008        NaN
8        291  ID001  3  20080930  2008        0.0
9        503  ID001  4  20081231  2008      212.0

Upvotes: 1

Related Questions