Reputation: 125
I have Dataframe which can be simplified to this:
import pandas as pd
df = pd.DataFrame([{
'title': 'batman',
'text': 'man bat man bat',
'url': 'batman.com',
'label':1},
{'title': 'spiderman',
'text': 'spiderman man spider',
'url': 'spiderman.com',
'label':1},
{'title': 'doctor evil',
'text': 'a super evil doctor',
'url': 'evilempyre.com',
'label':0},])
And I want to try different feature extraction methods: TFIDF, word2vec, Coutvectorizer with different ngram settings, etc. But I want to try it in different combinations: one feature set will contain 'text' data transformed with TFIDF, and 'url' with Countvectoriser and second will have text data converted by w2v, and 'url' by TFIDF and so on. In the end, of course, I want to make a comparison of different preprocessing strategies and choose the best one.
And here are the questions:
Is there a way to do such things using standard sklearn tools like Pipeline?
Is there a common sense in my idea? Maybe there are good ideas how to treat text data with many columns in Dataframes which I am missing?
Many thanks!
Upvotes: 5
Views: 6270
Reputation: 111
Take a look at the following link: http://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, x, y=None):
return self
def transform(self, data_dict):
return data_dict[self.key]
The key value accepts a panda dataframe column label. When using it in your pipeline it can be applied as:
('tfidf_word', Pipeline([
('selector', ItemSelector(key='column_name')),
('tfidf', TfidfVectorizer())),
]))
Upvotes: 4
Reputation: 16099
@elphz answer is a good intro to how you could use FeatureUnion
and FunctionTransformer
to accomplish this, but I think it could use a little more detail.
First off I would say you need to define your FunctionTransformer
functions such that they can handle and return your input data properly. In this case I assume you just want to pass the DataFrame, but ensure that you get back a properly shaped array for use downstream. Therefore I would propose passing just the DataFrame and accessing by column name. Like so:
def text(X):
return X.text.values
def title(X):
return X.title.values
pipe_text = Pipeline([('col_text', FunctionTransformer(text, validate=False))])
pipe_title = Pipeline([('col_title', FunctionTransformer(title, validate=False))])
Now, to test the variations of transformers and classifiers. I would propose using a list of transformers and a list of classifiers and simply iterating through them, much like a gridsearch.
tfidf = TfidfVectorizer()
cv = CountVectorizer()
lr = LogisticRegression()
rc = RidgeClassifier()
transformers = [('tfidf', tfidf), ('cv', cv)]
clfs = [lr, rc]
best_clf = None
best_score = 0
for tran1 in transformers:
for tran2 in transformers:
pipe1 = Pipeline(pipe_text.steps + [tran1])
pipe2 = Pipeline(pipe_title.steps + [tran2])
union = FeatureUnion([('text', pipe1), ('title', pipe2)])
X = union.fit_transform(df)
X_train, X_test, y_train, y_test = train_test_split(X, df.label)
for clf in clfs:
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
if score > best_score:
best_score = score
best_est = clf
This is a simple example, but you can see how you could plug in any variety of transformations and classifiers in this way.
Upvotes: 2
Reputation: 5718
I would use a combination of FunctionTransformer to select only certain columns, and then FeatureUnion to combine TFIDF, word count, etc features on each column. There may be a slightly cleaner way, but I think you'll end up with some sort of FeatureUnion and Pipeline nesting regardless.
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
def first_column(X):
return X.iloc[:, 0]
def second_column(X):
return X.iloc[:, 1]
# pipeline to get all tfidf and word count for first column
pipeline_one = Pipeline([
('column_selection', FunctionTransformer(first_column, validate=False)),
('feature-extractors', FeatureUnion([('tfidf', TfidfVectorizer()),
('counts', CountVectorizer())
]))
])
# Then a second pipeline to do the same for the second column
pipeline_two = Pipeline([
('column_selection', FunctionTransformer(second_column, validate=False)),
('feature-extractors', FeatureUnion([('tfidf', TfidfVectorizer()),
('counts', CountVectorizer())
]))
])
# Then you would again feature union these pipelines
# to get different feature selection for each column
final_transformer = FeatureUnion([('first-column-features', pipeline_one),
('second-column-feature', pipeline_two)])
# Your dataframe has your target as the first column, so make sure to drop first
y = df['label']
df = df.drop('label', axis=1)
# Now fit transform should work
final_transformer.fit_transform(df)
If you don't want to apply multiple transformer to each column (tfidf and counts both likely won't be useful) then you could cut down on the nesting by one step.
Upvotes: 2