Reputation: 201
I am trying to estimate a panel regression (see: https://bashtage.github.io/linearmodels/doc/panel/examples/examples.html)
My data is formatted like that (thats just an example snippet; in the orginal file there are 11 columns plus the timestamp and thousands of rows):
What I have
Timestamp Country Dummy Pre Post All_Countries Timestamp
1993-11-01 1 0 1 6.18 1993-11-01
1993-11-02 1 0 1 6.18 1993-11-02
1993-11-03 1 0 1 6.17 1993-11-03
1993-11-04 1 1 0 6.17 1993-11-04
1993-11-15 1 1 0 6.40 1993-11-15
1993-11-01 2 0 1 7.05 1993-11-01
1993-11-02 2 0 1 7.05 1993-11-02
1993-11-03 2 0 1 7.20 1993-11-03
1993-11-04 2 1 0 7.50 1993-11-04
1993-11-15 2 1 0 7.60 1993-11-15
1993-11-01 3 0 1 7.69 1993-11-01
1993-11-02 3 0 1 7.61 1993-11-02
1993-11-03 3 0 1 7.67 1993-11-03
1993-11-04 3 1 0 7.91 1993-11-04
1993-11-15 3 1 0 8.61 1993-11-15
How you can re-create it
import numpy as np
import pandas as pd
df = pd.DataFrame({"Timestamp" : ['1993-11-01' ,'1993-11-02', '1993-11-03', '1993-11-04','1993-11-15'], "Pre" : [0 ,0, 0, 1, 1], "Post" : [1 ,1, 1, 0, 0], "Austria" : [6.18 ,6.18, 6.17, 6.17, 6.40],"Belgium" : [7.05, 7.05, 7.2, 7.5, 7.6],"France" : [7.69, 7.61, 7.67, 7.91, 8.61]},index = [1, 2, 3,4,5])
df
index_data = df.melt(['Timestamp','Pre','Post'], var_name='Country Dummy', value_name='All_Countries')
index_data['Country Dummy'] = index_data['Country Dummy'].factorize()[0] + 1
# pd.Categorical(out['Country Dummy']).codes + 1
timestamp = pd.Categorical(index_data['Timestamp'])
index_data = index_data.set_index(['Timestamp', 'Country Dummy'])
index_data['Timestamp'] = timestamp
index_data
**What I do **
!pip install linearmodels
from linearmodels.panel import PooledOLS
import statsmodels.api as sm
exog_vars = ['Pre','Post']
exog = sm.add_constant(index_data[exog_vars])
mod = PooledOLS(index_data.All_Countries, exog)
pooled_res = mod.fit()
print(pooled_res)
**What I get **
"ValueError: exog does not have full column rank."
Question
Anyone an idea what could cause that problem?
Idea
Is it because my data should be formatted like that (see example in link at the top): --> and if yes, how could I get that
Timestamp Country Dummy Pre Post All_Countries Timestamp
1993-11-01 1 0 1 6.18 1993-11-01
1993-11-02 0 1 6.18 1993-11-02
1993-11-03 0 1 6.17 1993-11-03
1993-11-04 1 0 6.17 1993-11-04
1993-11-15 1 0 6.40 1993-11-15
1993-11-01 2 0 1 7.05 1993-11-01
1993-11-02 0 1 7.05 1993-11-02
1993-11-03 0 1 7.20 1993-11-03
1993-11-04 1 0 7.50 1993-11-04
1993-11-15 1 0 7.60 1993-11-15
1993-11-01 3 0 1 7.69 1993-11-01
1993-11-02 0 1 7.61 1993-11-02
1993-11-03 0 1 7.67 1993-11-03
1993-11-04 1 0 7.91 1993-11-04
1993-11-15 1 0 8.61 1993-11-15
Upvotes: 2
Views: 4540
Reputation: 6132
That error is being raised because Pre
is a linear combination of Post
. You should only use one of those columns because the other doesn't add information (and breaks the algebra behind your model). In this case:
Pre = 1 - Post
This is the same reason you drop one dummy that will serve as a baseline when running an OLS model.
This should work:
exog_vars = ['Post']
exog = sm.add_constant(index_data[exog_vars])
mod = PooledOLS(index_data.All_Countries, exog)
pooled_res = mod.fit()
Upvotes: 3