Reputation: 189
I was recently reading about VIF function implementation in python and I came across this article.
I am not able to understand the operation which is taking place in this particular line.
features = "+".join(df.columns - ["annual_inc"])
I understand what would be the output when the statement is
features = "+".join(df.columns)
Can anyone explain what is the significance of - ["annual_inc"]
in the statement ?
Upvotes: 1
Views: 184
Reputation: 4872
For patsy.dmatrices
the first arument of the function is formula_like
, which has to be a string like y ~ x1 + x2
. Here in features
, you are creating a string with all columns (join with a +
in between) except your target variable which is annual_inc
. Next you have to creaate the input string to formula_like
, ie target ~ variable1 + variable2 + ...
, in your case 'annual_inc ~' + features
.
dmatrices('annual_inc ~' + features, df, return_type='dataframe')
refer patsy.dmatrices
Upvotes: 1
Reputation: 627
"annual_inc"
is the target variable of the regression, and therefore is excluded from the set of features.
Upvotes: 1
Reputation: 862681
I think there is used old pandas code, now raise error:
df = pd.DataFrame(columns=['a','b','annual_inc'])
print (df.columns - ["annual_inc"])
TypeError: unsupported operand type(s) for -: 'str' and 'str'
So use Index.difference
for exclude values of list from columns names:
print(df.columns.difference(["annual_inc"]))
Index(['a', 'b'], dtype='object')
features = "+".join(df.columns.difference(["annual_inc"]))
print(features)
a+b
Upvotes: 2