Reputation: 2543
I want to do a binary classification based on different features I have (both text and numerical). Training data is the form of pandas dataframe. My pipeline looks something like this:
final_pipeline = Pipeline([('union', FeatureUnion(
transformer_list=[('body_trans', Pipeline([('selector', ItemSelector(key='body')),
('count_vect', CountVectorizer())])),
('body_trans2', Pipeline([('selector', ItemSelector(key='body2')),
('count_vect', TfidfVectorizer())])),
('length_trans', Pipeline([('selector', ItemSelector(key='length')),
('min_max_scaler', MinMaxScaler())]))],
transformer_weights={'body_trans': 1.0,'body_trans2': 1.0,'length_trans': 1.0})),
('svc', SVC())])
ItemSelector looks like this:
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_frame):
return data_frame[[self.key]]
Now, when I try final_pipeline.fit(X_train, y_train)
, it gives me the ValueError: blocks[0,:] has incompatible row dimensions
exception.
X_train, X_test, y_train, y_test = train_test_split(train_set, target_set)
is how I get my training data.
train_set
is a dataframe with the fields body
, body2
, length
etc. target_set
is a dataframe with only a field called label
which is my actual label to classify.
Edit:
I think my input data to the pipeline is not in the proper format.
train_set
is my training data with the features, sample :
body length body2
0 blah-blah 193 blah-blah-2
1 blah-blah-blah 153 blah-blah-blah-2
and the target_set
, which is the DataFrame with the classifying label
label
0 True
1 False
If there is any tutorial on input format for a Pipeline's fitting parameters using DataFrames, please provide me with a link! I can't find proper documentation as to how to load DataFrames as input for Pipelines while using multiple columns as separate features.
Any help is appreciated!
Upvotes: 1
Views: 1912
Reputation: 36619
The issue is in your ItemSelector. It outputs a 2-d dataframe, but the CountVectorizer and TfidfVectorizer needs a 1-d array of strings.
Code to show the output of ItemSelector:-
import numpy as np
from pandas import DataFrame
df = DataFrame(columns = ['body','length','body2'],data=np.array([['blah-blah', 193, 'blah-blah-2'],['blah-blah-2', 153, 'blah-blah-blah-2'] ]))
body_selector = ItemSelector(key='body')
df_body = body_selector.fit_transform(df)
df_body.shape
# (2,1)
You can define another class which can ravel the data to be presented to next step in correct form.
Add this class to your code like this:
class Converter(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, data_frame):
return data_frame.values.ravel()
Then define your pipeline like this:
final_pipeline = Pipeline([('union', FeatureUnion(
transformer_list=[('body_trans', Pipeline([('selector', ItemSelector(key='body')),
('converter', Converter()),
('count_vect', CountVectorizer())])),
('body_trans2', Pipeline([('selector', ItemSelector(key='body2')),
('converter', Converter()),
('count_vect', TfidfVectorizer())])),
('length_trans', Pipeline([('selector', ItemSelector(key='length')),
('min_max_scaler', MinMaxScaler())]))],
transformer_weights={'body_trans': 1.0,'body_trans2': 1.0,'length_trans': 1.0})),
('svc', SVC())])
No need to add this to third part, because MinMaxScalar requires 2-D input data.
Feel free to ask if any problem.
Upvotes: 7