Reputation: 251
I have a dataset with different type of variables: binary, categorical, numerical, textual.
Text Age Type Link Start Passed Default
0 care packag saint luke cathol church wa ... 21.0 organisation saintlukemclean <2001.0 0 0
1 opportun busi group center food support compan... 23.0 organisation cfanj <2003.0 0 0
2 holiday ice rink persh squar depart cultur sit... 98.0 home culturela >1975.0 0 0
I have used different transformers, one for categorical (OneHotEncoder), one for numerical (SimpleImputer) and one for text variables (CountVectorizer/TF-IDF):
categorical_preprocessing = OneHotEncoder(handle_unknown='ignore')
# categorical_encoder = ('CV',CountVectorizer())
numeric_preprocessing = Pipeline([
('imputer', SimpleImputer(strategy='mean'))
])
# CountVectorizer
text_preprocessing_cv = Pipeline(steps=[
('CV',CountVectorizer())
])
# TF-IDF
text_preprocessing_tfidf = Pipeline(steps=[
('TF-IDF',TfidfVectorizer())
])
to transform my features and passing them in pipelines (with classifiers Logistic Regression, Multinomial Naive Bayer, Random Forest and SVM) as follows:
preprocessing = ColumnTransformer(
transformers=[
('text',text_preprocessing_cv, text_columns)
('category', categorical_preprocessing, categorical_columns),
('numeric', numeric_preprocessing, numerical_columns)
])
However, I have got an error at this step:
from sklearn.linear_model import LogisticRegression
clf = Pipeline(steps=[('preprocessor', preprocessing),
('classifier', LogisticRegression())])
clf.fit(X_train, y_train) # <-- error
ValueError: Selected columns, ['Age','Default'] are not unique in dataframe.
This error might be caused because of my oversampling or because of my way to pre-process features ... The right order for the resampling should be applying it only to the train set to avoid overfitting, but it is not clear to me if I need to consider the different types of variables and the transformers before/after resampling.
I would appreciate if you could help me in fixing the error, letting a pipeline working using those preprocessing. Thanks
Please refer to the code:
text_columns = ['Text']
categorical_columns = ['Type', 'Link','Start']
numerical_columns = ['Age','Default'] # can I consider the boolean as numerical?
X = df[categorical_columns + numerical_columns+text_columns]
y= df['Passed']
X_train, X_test, y_train, y_test = train_test_split(
X, y, stratify=y, random_state=42)
# Returning to one dataframe
training_set = pd.concat([X_train, y_train], axis=1) # need for re-sampling technique
passed=training_set[training_set['Passed']==1]
not_passed=training_set[training_set['Passed']==0]
# Oversampling the minority
oversample = resample(passed,
replace=True,
n_samples=len(not_passed),
# Returning to new training set
oversample_train = pd.concat([not_passed, oversample])
train_df = oversample_train.copy() # this train set is after applying the re-sampling
test_df = pd.concat([X_test, y_test], axis=1)
X_train=train_df.loc[:,train_df.columns !='Passed']
y_train=train_df[['Passed']
categorical_encoder = OneHotEncoder(handle_unknown='ignore')
numerical_pipe = Pipeline([
('imputer', SimpleImputer(strategy='mean'))
])
text_transformer_cv = Pipeline(steps=[
('cntvec',CountVectorizer())
])
# TF-IDF
text_preprocessing_tfidf = Pipeline(steps=[
('TF-IDF',TfidfVectorizer())
]) # TF-IDF
preprocessing = ColumnTransformer(
transformers=
[('category', categorical_encoder, categorical_columns),
('numeric', numerical_pipe, numerical_columns), # I think this is causing the error. But I do not know why not also categorical columns
('text',text_transformer_cv, text_columns)
])
clf = Pipeline(steps=[('preprocessor', preprocessing),
('classifier', LogisticRegression())])
clf.fit(X_train, y_train)
```
Upvotes: 2
Views: 1387
Reputation: 15568
The issue is the way a single text column is passed. I hope future version of scikit-learn would allow ['Text',]
but until then pass it directly:
...
text_columns = 'Text' # instead of ['Text']
preprocessing = ColumnTransformer(
transformers=[
('text', text_preprocessing_cv, text_columns),
('category', categorical_preprocessing,
categorical_columns),
('numeric', numeric_preprocessing, numerical_columns)
],
remainder='passthrough'
)
Upvotes: 2