Reputation: 301
A lot of questions is answered regarding this, however, I could not figure out one thing.
I have a dataframe and I am performing regression,after that the results are stored in the new columns in Test
dataframe. To compare methods I need to do both linear and polynomial regression.
I have found a way to beautifully do this with linear regression, where in result I have predicted values in new column of dataframe Test
. But I cannot make this work within the same loop using polynomial regression, cause in the final Test
dataframe I have multiple Null values as in the step of model_2.fit_transform(X)
values somehow loses the corresponding Test
index.
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures
Test = pd.read_csv(r'D:\myfile.csv')
df_coef =[]
value = list(set(Test['Value']))
for value in value:
df_redux = Test[Test['Value'] == value]
Y = df_redux['Y']
X = df_redux[['X1', 'A', 'B', 'B']]
X = sm.add_constant(X)
# linear
model_1 = sm.OLS(Y, X).fit()
predictions_1 = model_1.predict(X)
# polynomial
polynomial_features = PolynomialFeatures(degree=2)
xp = polynomial_features.fit_transform(X)
model_2 = sm.OLS(Y, xp).fit()
predictions_2 = model_2.predict(xp)
stats_1 = pd.read_html(model_1.summary().tables[1].as_html(), header=0, index_col=0)[0]
stats_2 = pd.read_html(model_2.summary().tables[1].as_html(), header=0, index_col=0)[0]
predictions_1 = pd.DataFrame(predictions_1, columns=['lin'])
predictions_2 = pd.DataFrame(predictions_2, columns=['poly'])
# ??? how to concat and appen both prediction_1 and prediction_2 in the same df_coef = [] dataframe?
gf = pd.concat([predictions_1, df_redux], axis=1)
df_coef.append(gf)
all_coef = pd.concat(df_coef)
type(all_coef)
Out[234]: pandas.core.frame.DataFrame
The problem is that tranformed xp
type is <class 'numpy.ndarray'>
, but X
type is <class 'pandas.core.frame.DataFrame'>
. The question is how can I get the polynomial regression predicted values in new column of Test
, next to linear reg. results. This is probably really simple, but I could not figure it out.
print(type(X))
print(type(xp))
print(X.sample(2))
print()
print(xp)
<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>
X1 A B G1
962 4.334912 1.945910 3.135494 3.258097
1365 4.197888 2.197225 3.135494 3.332205
[[ 1. 4.77041663 1.94591015 ... 35.74106743 34.52550933
33.35129251]
[ 1. 4.43240629 1.94591015 ... 33.28387641 32.03140262
30.82605947]
[ 1. 3.21669428 1.94591015 ... 29.95821572 30.38903979
30.82605947]
The result which I get with polynominal reg. predicted values appended to original Test dataframe:
0 6.178542 3.0 692 ... 2.079442 4.783216 6.146329
1 6.156108 11.0 692 ... 2.197225 4.842126 6.113682
2 6.071453 12.0 692 ... 2.197225 4.814595 6.052089
3 5.842053 NaN NaN ... NaN NaN NaN
4 4.625762 30.0 692 ... 1.945910 5.018201 5.828946
This is the correct and good result I obtained using only linear regression, without Nan and with value in each row, how it supposed to be:
0 6.151675 3 692 5 ... 3.433987 2.079442 4.783216 6.146329
1 6.132077 11 692 5 ... 3.401197 2.197225 4.842126 6.113682
2 6.068450 12 692 5 ... 3.332205 2.197225 4.814595 6.052089
4 5.819535 30 692 5 ... 3.258097 1.945910 5.018201 5.828946
8 4.761362 61 692 5 ... 2.564949 1.945910 3.889585 4.624973
Upvotes: 0
Views: 799
Reputation: 301
Solve this by adding a line for numpy to series tranformation. And for model statistics statsmodels summary:
import pandas as pd
from statsmodels.api import OLS
predictions_2 = model_2.predict(xp)
predictions_2_series = pd.Series(predictions_2, index=df_redux.index.values)
print(OLS(Y, xp).fit().summary())
Upvotes: 1