Juned Ansari
Juned Ansari

Reputation: 5283

Feature Extraction for multiple text columns for classification problem

which is the correct way to extract features from multiple text columns and apply any classification algorithm on it? please suggest me, if i am going wrong

example dataset

enter image description here

Independent Variables : Description1,Description2, State, NumericCol1,NumericCol2

Dependent Variable : TargetCategory

Code:

########### Feature Exttraction for Text Data #####################
######### Description1 (it can be any wordembedding technique like countvectorizer, tfidf, word2vec,bert..etc)
tfidf = TfidfVectorizer(max_features = 500, 
                              ngram_range = (1,3),
                              stop_words = "english")
X_Description1 = tfidf.fit_transform(df["Description1"].tolist())

######### Description2 (it can be any wordembedding technique like countvectorizer, tfidf, word2vec,bert..etc)
tfidf = TfidfVectorizer(max_features = 500, 
                              ngram_range = (1,3),
                              stop_words = "english")
X_Description2 = tfidf.fit_transform(df["Description2"].tolist())


######### State (have 100 unique entries thats why used BinaryEncoder)
import category_encoders as ce
binary_encoder= ce.BinaryEncoder(cols=['state'],return_df=True)
X_state = binary_encoder.fit_transform(df["state"])


import scipy
X = scipy.sparse.hstack((X_Description1, 
                         X_Description2,
                         X_state,
                         df[["NumericCol1", "NumericCol2"]].to_numpy())).tocsr()

y = df['TargetCategory']


##### train Test Split ########
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=111)

##### Create Model Model ######
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, classification_report, cohen_kappa_score
from sklearn import metrics 

# Baseline Random forest based Model
rfc = RandomForestClassifier(criterion = 'gini', n_estimators=1000, verbose=1, n_jobs = -1, 
                             class_weight = 'balanced', max_features = 'auto')
rfcg = rfc.fit(X_train,y_train) # fit on training data


####### Prediction ##########
predictions = rfcg.predict(X_test)
print('Baseline: Accuracy: ', round(accuracy_score(y_test, predictions)*100, 2))
print('\n Classification Report:\n', classification_report(y_test,predictions))

Upvotes: 0

Views: 2372

Answers (1)

ygorg
ygorg

Reputation: 770

The way to use multiple columns as input in scikit-learn is by using the ColumnTransformer.

Here is an example on how to use it with heterogeneous data.

Upvotes: 1

Related Questions