Reputation: 745
I am trying to perform classification in Python using Pandas and scikit-learn. My dataset contains a mix of text variables, numerical variables and categorical variables.
Let's say my dataset looks like this:
Project Cost Project Category Project Description Project Outcome
12392.2 ABC This is a description Fully Funded
493992.4 DEF Stack Overflow rocks Expired
And I need to predict the variable Project Outcome
. Here is what I did (assuming df
contains my dataset):
I converted the categories Project Category
and Project Outcome
to numeric values
df['Project Category'] = df['Project Category'].factorize()[0]
df['Project Outcome'] = df['Project Outcome'].factorize()[0]
Dataset now looks like this:
Project Cost Project Category Project Description Project Outcome
12392.2 0 This is a description 0
493992.4 1 Stack Overflow rocks 1
Then I processed the text column using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
df['Project Description'] = tfidf_vectorizer.fit_transform(df['Project Description'])
Dataset now looks something like this:
Project Cost Project Category Project Description Project Outcome
12392.2 0 (0, 249)\t0.17070240732941433\n (0, 304)\t0.. 0
493992.4 1 (0, 249)\t0.17070240732941433\n (0, 304)\t0.. 1
So since all variables are now numerical values, I thought I would be good to go to start training my model
X = df.drop(columns=['Project Outcome'], axis=1)
y = df['Project Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
model = MultinomialNB()
model.fit(X_train, y_train)
But I get the error ValueError: setting an array element with a sequence.
when attempting to do the model.fit
. When I print X_train
, I noticed that Project Description
was replaced by NaN
for some reason.
Any help on this? Is there a good way to do classification using variables with various data types? Thank you.
Upvotes: 5
Views: 3943
Reputation: 4162
Replace this
df['Project Description'] = tfidf_vectorizer.fit_transform(df['Project Description'])
with
df['Project Description'] = tfidf_vectorizer.fit_transform(df['Project Description']).toarray()
You can also use: tfidf_vectorizer.fit_transform(df['Project Description']).todense()
Also you should not simply convert categories to numbers. For example if you convert A, B and C to 0,1 and 2. They are taken as 2>1>0 and hence C>B>A which is usually not the case as A is just different than B and C. For this you can use One-Hot-Encoding (in Pandas you can use 'get_dummies' for this). You can use the code below for all your categorical features.
#df has all not categorical features
featurelist_categorical = ['Project Category', 'Feature A',
'Feature B']
for i,j in zip(featurelist_categorical, ['Project Category','A','B']):
df = pd.concat([df, pd.get_dummies(data[i],prefix=j)], axis=1)
The feature prefix is not necessary but will help you specially in case of multiple categorical features.
Also if you don't want to split your features into numbers for some reason you can use H2O.ai. With H2O you can directly feed categorical variables into models as text.
Upvotes: 2
Reputation: 1306
The problem arises in Step 2 with tfidf_vectorizer.fit_transform(df['Project Description'])
because tfidf_vectorizer.fit_transform returns a sparse matrix, which is then stored in a squashed form in the df['Project Description'] column. You want to keep the result as a sparse (or less ideally as a dense) matrix for the model training and testing. Here's example code for preparing the data in a dense form
import pandas as pd
import numpy as np
df = pd.DataFrame({'project_category': [1,2,1],
'project_description': ['This is a description','Stackoverflow rocks', 'Another description']})
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['project_description']).toarray()
X_all_data_tfidf = np.hstack((df['project_category'].values.reshape(len(df['project_category']),1), X_train_tfidf))
The last line we add on the 'project_category' for if you want to include that as a feature in your model.
Upvotes: 1