How to Retrieve Original Variables After Scikit Model Run w/OneHotEncoding

Question

I have successfully ran a logistic regression model from the scikit-learn SGDClassifier package but cannot easily interpret the model's coefficients (accessed via SGDClassifier.coef_) because the input data was transformed via scikit-learn's OneHotEncoder.

My original input data X is of shape (12000,11):

X = np.array([[1,4,3...9,4,1],
              [5,9,2...3,1,4],
              ...
              [7,8,1...6,7,8]
              ])

I then applied one hot encoding:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
X_OHE = enc.fit_transform(X).toarray()

which produces an array of shape (12000, 696):

X_OHE = np.array([[1,0,1...0,0,1],
                 [0,0,0...0,1,0],
                  ...
                 [1,0,1...0,0,1]
                 ])

I then access the model's coefficients with SGDClassifier.coef_ which produces an array of shape (1,696):

coefs = np.array([[-1.233e+00,0.9123e+00,-2.431e+00...-0.238e+01,-1.33e+00,0.001e-01]])

How do I map the coefficient values back to the original values in X, so I can say something like, "if variable foo has a value of bar, the target variable increases/decreases by bar_coeff"?

Let me know if you need more info on the data or the model parameters. Thank you.

I found one unanswered question about this on SO: How to retrieve coefficient names after label encoding and one hot encoding on scikit-learn?

How to Retrieve Original Variables After Scikit Model Run w/OneHotEncoding

Answers (1)

Related Questions