Reputation: 11
I'm new to scikit learn and I just saw the documentation and a couple of other stackoverflow posts to build a decision tree. I have a CSV data set with 16 attributes and 1 target label. How should I pass it into the decision tree classifier? My current code looks like this:
import pandas
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import tree
data = pandas.read_csv("yelp_atlanta_data_labelled.csv", sep=',')
vect = TfidfVectorizer()
X = vect.fit_transform(data)
Y = data['go']
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
When I run the code it gives me the following error:
ValueError: Number of labels=501 does not match number of samples=17
To give some context, my data set has 501 data points and 17 total columns. The go
column is the target column with yes/no labels.
Upvotes: 1
Views: 1132
Reputation: 8270
The problem is TfidfVectorizer
cannot operate on a dataframe directly. It can only operate on a sequence of strings. Because you are passing a dataframe, it sees it as a sequence of columns and attempts to vectorize each column separately.
Try instead using:
X = vect.fit_transform(data['my_column_name'])
You may want to preprocess the dataframe to concatenate different columns prior to calling vect.fit_transform
.
Upvotes: 1