Reputation: 23
I have a dataset of around 400 rows with several categorical data columns and also a column of a description in a text form as the input for my classification model. I am planning to perform classification by using SVM as my classification model. Since the model cannot accept non-numeric data as input therefore I have converted the input features to numeric data
I have performed TF-IDF for my description column and it has converted the terms into matrix form.
Do I need to convert the categorical features by using label encoding and then merge it with the TF-IDF before feeding it into the machine learning model?
Upvotes: 2
Views: 2068
Reputation: 15568
Use ColumnTransformer
to apply different pipelines transformation to columns with different data types. Here is an example:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.svm import SVC
# pipeline for text data
text_features = 'text_column'
text_transformer = Pipeline(steps=[
('vectorizer', TfidfVectorizer(stop_words="english"))
])
# pipeline for categorical data
categorical_features = ['cat_col1', 'cat_col2',]
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# you can add other transformations for other data types
# combine preprocessing with ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('text', text_transformer, text_features),
('cat', categorical_transformer, categorical_features)
])
# add model to be part of pipeline
clf_pipe = Pipeline(steps=[('preprocessor', preprocessor),
("model", SVC())
])
# ...
## you can just use preprocessor by itself
# X_train = preprocessor.fit_transform(X_train)
# X_test = preprocessor.transform(X_test)
# clf_s= SVC().fit(X_train, y_train)
# clf_s.score(X_test, y_test)
## or better, you can use the whole.
# clf_pipe.fit(X_train, y_train)
# clf_pipe.score(X_test, y_test)
See Scikit-learn Example for more details
Upvotes: 4