Python Developer
Python Developer

Reputation: 633

Using numpy.ndarray vs. Pandas Dataframe in sklearn's .fit() method

I'm using a Logistic Regression model on my data. From what I understand (e.g. from here: Pandas vs. Numpy Dataframes), it's better to use numpy.ndarray with sklearn than to use Pandas Dataframes. This can be done by using the .values attribute on the dataframe. I have done this, but get the ValueError: Specifying the columns using strings is only supported for pandas DataFrames. Clearly, I am doing something wrong with my code. Any insights are much appreciated.

Funnily enough, my code works when I don't use .values, and just use X as a DataFrame and y as a Pandas Series.

# We will train our classifier with the following features:
# Numeric features to be scaled: LIMIT_BAL, AGE, PAY_X, BIL_AMTX, and PAY_AMTX
# Categorical features: SEX, EDUCATION, MARRIAGE

# We create the preprocessing pipelines for both numeric and categorical data
numeric_features = ['LIMIT_BAL', 'AGE', 'PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 
                 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 
                 'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

data['PAY_1'] = data.PAY_1.astype('float64')
data['PAY_2'] = data.PAY_2.astype('float64')
data['PAY_3'] = data.PAY_3.astype('float64')
data['PAY_4'] = data.PAY_4.astype('float64')
data['PAY_5'] = data.PAY_5.astype('float64')
data['PAY_6'] = data.PAY_6.astype('float64')
data['AGE'] = data.AGE.astype('float64')


numeric_transformer = Pipeline(steps=[
('scaler', MinMaxScaler())
])

categorical_features = ['SEX', 'EDUCATION', 'MARRIAGE']
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(categories='auto'))
])

preprocessor = ColumnTransformer(
transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

y = data['default'].values
X = data.drop('default', axis=1).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
random_state=10, stratify=y)

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
lr = Pipeline(steps=[('preprocessor', preprocessor),
                 ('classifier', LogisticRegression(solver='liblinear'))])

param_grid_lr = {
'classifier__C': np.logspace(-5, 8, 15)
}

lr_cv = GridSearchCV(lr, param_grid_lr, cv=10, iid=False)

lr_cv.fit(X_train, y_train)

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

Upvotes: 2

Views: 3666

Answers (1)

Matthieu Brucher
Matthieu Brucher

Reputation: 22023

You are using ColumnTransformer as if you had a dataframe, but you don't have one...

column(s) : string or int, array-like of string or int, slice, boolean mask array or callable

Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above.

If you pass strings for the columns, you need to pass a dataframe. If you want to use a numpy array, then first the transtyping may not be required and you need to specify integers and not strings as index.

Upvotes: 2

Related Questions