Machine Learning Classification using categorical and text data as input

Question

I have a dataset of around 400 rows with several categorical data columns and also a column of a description in a text form as the input for my classification model. I am planning to perform classification by using SVM as my classification model. Since the model cannot accept non-numeric data as input therefore I have converted the input features to numeric data

I have performed TF-IDF for my description column and it has converted the terms into matrix form.

Do I need to convert the categorical features by using label encoding and then merge it with the TF-IDF before feeding it into the machine learning model?

Prayson W. Daniel · Accepted Answer

Use ColumnTransformer to apply different pipelines transformation to columns with different data types. Here is an example:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.svm import SVC


# pipeline for text data
text_features = 'text_column'
text_transformer = Pipeline(steps=[
    ('vectorizer', TfidfVectorizer(stop_words="english"))
])

# pipeline for categorical data
categorical_features = ['cat_col1', 'cat_col2',]
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# you can add other transformations for other data types

# combine preprocessing with ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_transformer, text_features),
        ('cat', categorical_transformer, categorical_features)
])

# add model to be part of pipeline
clf_pipe =  Pipeline(steps=[('preprocessor', preprocessor),
                   ("model", SVC())
])

# ...

## you can just use preprocessor by itself
# X_train = preprocessor.fit_transform(X_train)
# X_test = preprocessor.transform(X_test)
# clf_s= SVC().fit(X_train, y_train)
# clf_s.score(X_test, y_test)

## or better, you can use the whole.
# clf_pipe.fit(X_train, y_train) 
# clf_pipe.score(X_test, y_test)

See Scikit-learn Example for more details

Machine Learning Classification using categorical and text data as input

Answers (1)

Related Questions