Reputation: 870
I am currently trying to predict whether a kickstarter project will be successful or no depending on a bunch of integer and some text features. I was looking at building a pipeline which would look something like this
Here is my ItemSelector and pipeline code
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, keys):
self.keys = keys
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.keys]
I verified that the ItemSelector is working as expected by
t = ItemSelector(['cleaned_text'])
t.transform(df)
And it extract the necessary columns
pipeline = Pipeline([
# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[
# Pipeline for pulling features from the post's subject line
('text', Pipeline([
('selector', ItemSelector(['cleaned_text'])),
('counts', CountVectorizer()),
('tf_idf', TfidfTransformer())
])),
# Pipeline for pulling ad hoc features from post's body
('integer_features', ItemSelector(int_features)),
]
)),
# Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
])
But when I run pipeline.fit(X_train, y_train) I receive this error. Any idea how to fix this?
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-27-317e1c402966> in <module>()
----> 1 pipeline.fit(X_train, y_train)
~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
266 This estimator
267 """
--> 268 Xt, fit_params = self._fit(X, y, **fit_params)
269 if self._final_estimator is not None:
270 self._final_estimator.fit(Xt, y, **fit_params)
~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params)
232 pass
233 elif hasattr(transform, "fit_transform"):
--> 234 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
235 else:
236 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \
~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/sklearn/pipeline.py in fit_transform(self, X, y, **fit_params)
740 self._update_transformer_list(transformers)
741 if any(sparse.issparse(f) for f in Xs):
--> 742 Xs = sparse.hstack(Xs).tocsr()
743 else:
744 Xs = np.hstack(Xs)
~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
456
457 """
--> 458 return bmat([blocks], format=format, dtype=dtype)
459
460
~/Anaconda/anaconda/envs/ds/lib/python3.5/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
577 exp=brow_lengths[i],
578 got=A.shape[0]))
--> 579 raise ValueError(msg)
580
581 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 81096, expected 1.
Upvotes: 1
Views: 1736
Reputation: 36619
The ItemSelector is returning a Dataframe, not an array. Thats why the scipy.hstack
is throwing up error. Change the ItemSelector as below:
class ItemSelector(BaseEstimator, TransformerMixin):
....
....
....
def transform(self, data_dict):
return data_dict[self.keys].as_matrix()
The error occurs in the integer_features
part of your pipeline. For the first part text
, the transformers below the ItemSelector support the Dataframe and hence, convert it to array correctly. But the second part only have ItemSelector and returns Dataframe.
Update:
In the comment, you have mentioned that you want to perform some actions on the resultant Dataframe returned from the ItemSelector. So instead of modifying the transform method of the ItemSelector, you can make a new Transformer and append it to the second part of your pipeline.
class DataFrameToArrayTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
def fit(self, x, y=None):
return self
def transform(self, X):
return X.as_matrix()
Then you pipeline should look like this:
pipeline = Pipeline([
# Use FeatureUnion to combine the features from subject and body
('union', FeatureUnion(
transformer_list=[
# Pipeline for pulling features from the post's subject line
('text', Pipeline([
('selector', ItemSelector(['cleaned_text'])),
('counts', CountVectorizer()),
('tf_idf', TfidfTransformer())
])),
# Pipeline for pulling ad hoc features from post's body
('integer', Pipeline([
('integer_features', ItemSelector(int_features)),
('array', DataFrameToArrayTransformer()),
])),
]
)),
# Use a SVC classifier on the combined features
('svc', SVC(kernel='linear')),
])
The main thing to understand here is that FeatureUnion will only handle 2-D arrays when combining them, so any other type like DataFrame may present a problem there.
Upvotes: 1