Reputation: 185
I have a DataFrame with 14 columns. I am using custom transformers to
My custom ColumnSelector transformer is:
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
try:
return X[self.columns]
except KeyError:
cols_error = list(set(self.columns) - set(X.columns))
raise KeyError("The DataFrame does not include the columns: %s" % cols_error)
Followed by custom TypeSelector:
class TypeSelector(BaseEstimator, TransformerMixin):
def __init__(self, dtype):
self.dtype = dtype
def fit(self, X, y=None):
return self
def transform(self, X):
assert isinstance(X, pd.DataFrame)
return X.select_dtypes(include=[self.dtype])
The original DataFrame, from which I select desired columns is df_with_types and has 981 rows. The columns I wish to extract are listed below along with there respective data types;
meeting_subject_stem_sentence : 'object', priority_label_stem_sentence : 'object', attendees: 'category', day_of_week: 'category', meeting_time_mins: 'int64'
I then proceed to construct my pipeline the following way
preprocess_pipeline = make_pipeline(
ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence']),
FeatureUnion(transformer_list=[
("integer_features", make_pipeline(
TypeSelector('int64'),
StandardScaler()
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
OneHotEnc()
)),
("text_features", make_pipeline(
TypeSelector("object"),
TfidfVectorizer(stop_words=stopWords)
))
])
)
The error thrown when I fit the pipeline to data is:
preprocess_pipeline.fit_transform(df_with_types)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,2].shape[0] == 2, expected 981.
I have a hunch this is happening because of the TFIDF vectorizer. Fitting things only on the TFIDF vectorizer without FeatureUnion...
the_pipe = Pipeline([('col_sel', ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence'])),
('type_selector', TypeSelector('object')), ('tfidf', TfidfVectorizer())])
When I fit the_pipe:
a = the_pipe.fit_transform(df_with_types)
This gives me a 2*2 matrix instead of 981.
(0, 0) 1.0
(1, 1) 1.0
Calling the feature names attribute using named_steps, I get
the_pipe.named_steps['tfidf'].get_feature_names()
[u'meeting_subject_stem_sentence', u'priority_label_stem_sentence']
It seems to be fitting only on the column names and not iterating through the documents. How do I achieve this in a Pipeline like the above one. Also, if I wanted to apply a pairwise distance/similarity function to each feature as part of the pipeline after ColumnSelector and TypeSelector, what must I do.
An example would be...
preprocess_pipeline = make_pipeline(
ColumnSelector(columns=['meeting_subject_stem_sentence',
'attendees', 'day_of_week', 'meeting_time_mins', 'priority_label_stem_sentence']),
FeatureUnion(transformer_list=[
("integer_features", make_pipeline(
TypeSelector('int64'),
StandardScaler(),
'Pairwise manhattan distance between each element of the integer feature'
)),
("categorical_features", make_pipeline(
TypeSelector("category"),
OneHotEnc(),
'Pairwise dice coefficient here'
)),
("text_features", make_pipeline(
TypeSelector("object"),
TfidfVectorizer(stop_words=stopWords),
'Pairwise cosine similarity here'
))
])
)
Please help. Being a beginner, I have been wracking my head with this to no avail. I have gone through zac_stewart's blog and many other similar ones but none seem to talk about how to use TFIDF with TypeSelector or ColumnSelector. Thank you so much for all the help. Hope I formulated the question clearly.
EDIT 1:
If I use a TextSelector transformer, like the following...
class TextSelector(BaseEstimator, TransformerMixin):
""" Transformer that selects text column from DataFrame by key."""
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
'''Create X attribute to be transformed'''
return self
def transform(self, X, y=None):
'''the key passed here indicates column name'''
return X[self.key]
text_processing_pipe_line_1 = Pipeline([('selector', TextSelector(key='meeting_subject')), ('text_1', TfidfVectorizer(stop_words=stopWords))])
t = text_processing_pipe_line_1.fit_transform(df_with_types)
(0, 656) 0.378616399898
(0, 75) 0.378616399898
(0, 117) 0.519159384271
(0, 545) 0.512337545421
(0, 223) 0.425773433566
(1, 154) 0.5
(1, 137) 0.5
(1, 23) 0.5
(1, 355) 0.5
(2, 656) 0.497937369182
This works and it is iterating through the documents, thus if I could make TypeSelector return a series, that would right? Thanks again for the help.
Upvotes: 0
Views: 1322
Reputation: 4150
Question 1
You have 2 columns that hold text:
Either apply TfidfVectorizer
separately on each of them and then apply FeatureUnion
or just concatenate the strings into 1 column and view this concatenation as one document.
I guess this is the root of your problem, since TfidfVectorizer.fit()
inputs raw_documents
and it needs to be an iterable that yields str. In your case it is an iterable that yields another iterable (holding 2 strings - one per each text columns).
Read the official docs for more info.
Question 2
You cannot use pairwise similarity/distances as a part of the pipeline because it is not a transformer. Transformers transform each sample independently of each other whereas a pairwise metric needs 2 samples at the same time. However, you can simply compute it after you fit_transform
the pipeline via metrics.pairwise.pairwise_distances
.
Upvotes: 1