confusedcoder
confusedcoder

Reputation: 11

ValueError using sklearn and pandas for decision trees?

I'm new to scikit learn and I just saw the documentation and a couple of other stackoverflow posts to build a decision tree. I have a CSV data set with 16 attributes and 1 target label. How should I pass it into the decision tree classifier? My current code looks like this:

import pandas
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import tree

data = pandas.read_csv("yelp_atlanta_data_labelled.csv", sep=',')
vect = TfidfVectorizer()
X = vect.fit_transform(data) 
Y = data['go']

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

When I run the code it gives me the following error:

ValueError: Number of labels=501 does not match number of samples=17

To give some context, my data set has 501 data points and 17 total columns. The go column is the target column with yes/no labels.

Upvotes: 1

Views: 1132

Answers (1)

David Maust
David Maust

Reputation: 8270

The problem is TfidfVectorizer cannot operate on a dataframe directly. It can only operate on a sequence of strings. Because you are passing a dataframe, it sees it as a sequence of columns and attempts to vectorize each column separately.

Try instead using:

X = vect.fit_transform(data['my_column_name']) 

You may want to preprocess the dataframe to concatenate different columns prior to calling vect.fit_transform.

Upvotes: 1

Related Questions