Python scikit-learn classification with mixed data types (text, numerical, categorical)

I am trying to perform classification in Python using Pandas and scikit-learn. My dataset contains a mix of text variables, numerical variables and categorical variables.

Let's say my dataset looks like this:

Project Cost        Project Category        Project Description       Project Outcome
12392.2             ABC                     This is a description     Fully Funded
493992.4            DEF                     Stack Overflow rocks      Expired

And I need to predict the variable Project Outcome. Here is what I did (assuming df contains my dataset):

I converted the categories Project Category and Project Outcome to numeric values

df['Project Category'] = df['Project Category'].factorize()[0]
df['Project Outcome'] = df['Project Outcome'].factorize()[0]

Dataset now looks like this:

Project Cost        Project Category        Project Description       Project Outcome
12392.2             0                       This is a description     0
493992.4            1                       Stack Overflow rocks      1

Then I processed the text column using TF-IDF

tfidf_vectorizer = TfidfVectorizer()
df['Project Description'] = tfidf_vectorizer.fit_transform(df['Project Description'])

Dataset now looks something like this:

Project Cost        Project Category        Project Description       Project Outcome
12392.2             0                       (0, 249)\t0.17070240732941433\n (0, 304)\t0..     0
493992.4            1                       (0, 249)\t0.17070240732941433\n (0, 304)\t0..     1

So since all variables are now numerical values, I thought I would be good to go to start training my model

X = df.drop(columns=['Project Outcome'], axis=1)
y = df['Project Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
model = MultinomialNB()
model.fit(X_train, y_train)

But I get the error ValueError: setting an array element with a sequence. when attempting to do the model.fit. When I print X_train, I noticed that Project Description was replaced by NaN for some reason.

Any help on this? Is there a good way to do classification using variables with various data types? Thank you.

Upvotes: 5

Answers (2)

amalik2205

Reputation: 4182

Replace this

df['Project Description'] = tfidf_vectorizer.fit_transform(df['Project Description'])

with

df['Project Description'] = tfidf_vectorizer.fit_transform(df['Project Description']).toarray()

You can also use: tfidf_vectorizer.fit_transform(df['Project Description']).todense()

Also you should not simply convert categories to numbers. For example if you convert A, B and C to 0,1 and 2. They are taken as 2>1>0 and hence C>B>A which is usually not the case as A is just different than B and C. For this you can use One-Hot-Encoding (in Pandas you can use 'get_dummies' for this). You can use the code below for all your categorical features.

#df has all not categorical features
featurelist_categorical = ['Project Category', 'Feature A',
           'Feature B']

for i,j in zip(featurelist_categorical, ['Project Category','A','B']):
  df = pd.concat([df, pd.get_dummies(data[i],prefix=j)], axis=1)

The feature prefix is not necessary but will help you specially in case of multiple categorical features.

Also if you don't want to split your features into numbers for some reason you can use H2O.ai. With H2O you can directly feed categorical variables into models as text.

Upvotes: 2

kevins_1

Reputation: 1306

The problem arises in Step 2 with tfidf_vectorizer.fit_transform(df['Project Description']) because tfidf_vectorizer.fit_transform returns a sparse matrix, which is then stored in a squashed form in the df['Project Description'] column. You want to keep the result as a sparse (or less ideally as a dense) matrix for the model training and testing. Here's example code for preparing the data in a dense form

import pandas as pd
import numpy as np
df = pd.DataFrame({'project_category': [1,2,1], 
                   'project_description': ['This is a description','Stackoverflow rocks', 'Another description']})

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['project_description']).toarray()
X_all_data_tfidf = np.hstack((df['project_category'].values.reshape(len(df['project_category']),1), X_train_tfidf))

The last line we add on the 'project_category' for if you want to include that as a feature in your model.

Upvotes: 1

Python scikit-learn classification with mixed data types (text, numerical, categorical)

Answers (2)

Related Questions