SGDClassifier generates Keyerror from sparse dataset

Question

I'm performing some text analysis, pulling in the data using Pandas.

X = pd.read_Csv('../data/training.tsv', sep ='	', na_values=['?'])
X['json'] = X['json'].apply(json.loads)
extractBody = lambda x: x['body'] if x.has_key('body') and x['body'] is not None else u'empty'
X_all['body'] = X['json'].map(extractBody)

I throw this into a scikit-learn vector, separating the tf-idf weighting step:

body_counter = CountVectorizer()
body_counts = body_counter.fit_transform(X_all['body'])
body_transform = TfidfTransformer()
body_counts = body_tranform.fit_transform(body_counts)

I want to use the SGDClassifier to predict a simple binary classification of "spam"/"non-spam" in a sense.

model - SGDClassifier(n_iter = 5, loss = log)
model.fit(body_counts, labels)

When this runs, the fit method generates the below KeyError:

...
return self.index.get_value(self,key)
...
return self._engine.get)value(series, key)

File "index.pyx", line 96, in pandas.index.INdexEngine.get_value (pandas/index.c:2873)
File "index.pyx", line 104, in pandas.index.IndexEngine.get_value (pandas/index.c:2685)
File "index.pyx", line 148, in pandas.index.IndexEngine.get_loc (pandas/index.c:3422)
File "hashtable.pyx", line 382, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6570)
File "hashtable.pyx", line 388, in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6511)
KeyError: 0

I'm not sure what's going on here. This model works fine when I only want to cross validate it (cross_val_score). I can run this dataset using naive_bayes or TruncatedSVD in scikit learn. This only happens when I try and fit this model, and I'm not sure why.

How do I fix this? or am I looking at a bug in scikit learn?

edit

Yes, unfortunately I had to rewrite my code into this post instead of copying, so there probably are some mistakes. I'm coding this on a laptop without a wifi connection.

X.shape = 7396, 105273
labels.len() = 7395
labels type = 'pandas.core.series.Series'

...I converted labels to a numpy array, and it went through!

It still baffles me that cross_val_score would accept the labels as is, but model.fit would not.

Thanks!

SGDClassifier generates Keyerror from sparse dataset

Answers (1)

Related Questions