Reputation: 427
I am trying to use the FeatureUnion functionality of scikit-learn Pipelines on a project where the data is in a database. I am having some foundational issues in how to structure what I'm doing.
I am creating two features from two different tables in the database. I have a fetch_x1, fetch_x2 method to grab the data of interest from the database tables as pandas DataFrames. I packs the two DataFrames into a dictionary of dataframe. In each transformer, I unpack the DataFrame of interest and operate on it. I'm kind of following the pattern of this post.
My code is below:
class Feature1Extractor(TransformerMixin):
def transform(self, dictionary_of_dataframes):
df = dictionary_of_dataframes['feature1_raw_data']
x = df.groupby('user_id').count()['x1']
return df
class Feature2Extractor(TransformerMixin):
def transform(self, dictionary_of_dataframes):
df = dictionary_of_dataframes['feature2']
x = df.groupby('user_id').sum()['x2']
return x
pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[
('feature1', Feature1Extractor()),
('feature2', Feature2Extractor())])),
('null', None)
])
pipeline.transform(dictionary_of_dataframes)
I'm running into another more foundational issue -- after transformation the two feature matrices that come out of each pipeline have a different number of rows. Consequently, the simple hstack at the end of FeatureUnion is failing like so:
ValueError: all the input array dimensions except for the concatenation axis must match exactly
This is fundamental to the data I have. There are a number of user_ids that are not present in the feature1 table, similarly there a number of user_ids that are not present in the feature2 table. This is fundamental to the data -- if a user has no data in the feature1 table, he/she never used that feature in the app, e.g. no data = no engagement with that feature. To make the example explicit, here's an example of the two df's that are being passed to each transformer:
df (for feature1)
user_id, x1, timestamp
1, 'click', 1/1/2016
1, 'click', 1/2/2016
2, 'click', 1/2/2016
df (for feature2)
user_id, x2, timestamp
2, 12.3, 1/2/2016
3, 14,5, 1/4/2016
Note how the DataFrame for feature1 does not have user 3, and the DataFrame for feature2 does not have user 1. When I did this without Pipelines, I would do an outer join and then fillna(0) on the resulting merged dataframe, e.g.
merged_df = pd.merge(df1, df1, how='outer', left_on=['user_id'], right_on=['user_id'])
final_df = merged_df.fillna(0)
But there does not seem to be any way to do this using the FeatureUnion method. And I can't seem to think of a clean workaround in the Pipeline framework... I have to run separate pipelines, transform each of them, do the outer join and fillna in pandas, and then run the completed feature matrix into a downstream modelling pipeline? Is there a better way? Looking to the community for help.
NOTE: I do NOT know the user_ids before hand. I am querying the tables based on the timestamp range... not user_id. The query itself tells me what users I should have in the training (or test) set.
Upvotes: 0
Views: 562
Reputation: 7457
Why, won't you build your own, pandas based union? Something like this... (I didn't tested it, just see the idea)
class DataMerging(BaseEstimator):
def __init__(self):
return self
def fit(self, x, y=None):
return self
def transform(self, dfs):
df1, df2 = dfs
merged_df = pd.merge(df1, df2, how='outer', left_on=['user_id'], right_on=['user_id']).fillna(0)
return merged_df.values #(return shape (n_features, n_samples))
pipeline = Pipeline([
('union', DataMerging,
('other thing', ...)
])
pipeline.fit(df1, df2)
Upvotes: 0