Reputation: 633
I'm using a Logistic Regression model on my data. From what I understand (e.g. from here: Pandas vs. Numpy Dataframes), it's better to use numpy.ndarray with sklearn than to use Pandas Dataframes. This can be done by using the .values attribute on the dataframe. I have done this, but get the ValueError: Specifying the columns using strings is only supported for pandas DataFrames. Clearly, I am doing something wrong with my code. Any insights are much appreciated.
Funnily enough, my code works when I don't use .values, and just use X as a DataFrame and y as a Pandas Series.
# We will train our classifier with the following features:
# Numeric features to be scaled: LIMIT_BAL, AGE, PAY_X, BIL_AMTX, and PAY_AMTX
# Categorical features: SEX, EDUCATION, MARRIAGE
# We create the preprocessing pipelines for both numeric and categorical data
numeric_features = ['LIMIT_BAL', 'AGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6',
'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6',
'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
data['PAY_1'] = data.PAY_1.astype('float64')
data['PAY_2'] = data.PAY_2.astype('float64')
data['PAY_3'] = data.PAY_3.astype('float64')
data['PAY_4'] = data.PAY_4.astype('float64')
data['PAY_5'] = data.PAY_5.astype('float64')
data['PAY_6'] = data.PAY_6.astype('float64')
data['AGE'] = data.AGE.astype('float64')
numeric_transformer = Pipeline(steps=[
('scaler', MinMaxScaler())
])
categorical_features = ['SEX', 'EDUCATION', 'MARRIAGE']
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(categories='auto'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
y = data['default'].values
X = data.drop('default', axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=10, stratify=y)
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
lr = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(solver='liblinear'))])
param_grid_lr = {
'classifier__C': np.logspace(-5, 8, 15)
}
lr_cv = GridSearchCV(lr, param_grid_lr, cv=10, iid=False)
lr_cv.fit(X_train, y_train)
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
Upvotes: 2
Views: 3666
Reputation: 22023
You are using ColumnTransformer
as if you had a dataframe, but you don't have one...
column(s) : string or int, array-like of string or int, slice, boolean mask array or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above.
If you pass strings for the columns, you need to pass a dataframe. If you want to use a numpy array, then first the transtyping may not be required and you need to specify integers and not strings as index.
Upvotes: 2