Sourav Dey
Sourav Dey

Reputation: 427

Sci-Kit Learn FeatureUnion with different number of rows

I am trying to use the FeatureUnion functionality of scikit-learn Pipelines on a project where the data is in a database. I am having some foundational issues in how to structure what I'm doing.

I am creating two features from two different tables in the database. I have a fetch_x1, fetch_x2 method to grab the data of interest from the database tables as pandas DataFrames. I packs the two DataFrames into a dictionary of dataframe. In each transformer, I unpack the DataFrame of interest and operate on it. I'm kind of following the pattern of this post.

My code is below:

class Feature1Extractor(TransformerMixin):

    def transform(self, dictionary_of_dataframes):
        df = dictionary_of_dataframes['feature1_raw_data']
        x = df.groupby('user_id').count()['x1']
        return df

class Feature2Extractor(TransformerMixin):

    def transform(self, dictionary_of_dataframes):
        df = dictionary_of_dataframes['feature2']
        x = df.groupby('user_id').sum()['x2']
        return x

pipeline = Pipeline([
    ('union', FeatureUnion(
        transformer_list=[
            ('feature1', Feature1Extractor()),
            ('feature2', Feature2Extractor())])),
    ('null', None)
])

pipeline.transform(dictionary_of_dataframes)

I'm running into another more foundational issue -- after transformation the two feature matrices that come out of each pipeline have a different number of rows. Consequently, the simple hstack at the end of FeatureUnion is failing like so:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

This is fundamental to the data I have. There are a number of user_ids that are not present in the feature1 table, similarly there a number of user_ids that are not present in the feature2 table. This is fundamental to the data -- if a user has no data in the feature1 table, he/she never used that feature in the app, e.g. no data = no engagement with that feature. To make the example explicit, here's an example of the two df's that are being passed to each transformer:

df (for feature1)

user_id, x1, timestamp
1, 'click', 1/1/2016
1, 'click', 1/2/2016
2, 'click', 1/2/2016

df (for feature2)

user_id, x2, timestamp
2, 12.3, 1/2/2016
3, 14,5, 1/4/2016

Note how the DataFrame for feature1 does not have user 3, and the DataFrame for feature2 does not have user 1. When I did this without Pipelines, I would do an outer join and then fillna(0) on the resulting merged dataframe, e.g.

merged_df = pd.merge(df1, df1, how='outer', left_on=['user_id'], right_on=['user_id'])
final_df = merged_df.fillna(0)

But there does not seem to be any way to do this using the FeatureUnion method. And I can't seem to think of a clean workaround in the Pipeline framework... I have to run separate pipelines, transform each of them, do the outer join and fillna in pandas, and then run the completed feature matrix into a downstream modelling pipeline? Is there a better way? Looking to the community for help.

NOTE: I do NOT know the user_ids before hand. I am querying the tables based on the timestamp range... not user_id. The query itself tells me what users I should have in the training (or test) set.

Upvotes: 0

Views: 562

Answers (1)

Dror Hilman
Dror Hilman

Reputation: 7457

Why, won't you build your own, pandas based union? Something like this... (I didn't tested it, just see the idea)

class DataMerging(BaseEstimator):

    def __init__(self):
        return self

    def fit(self, x, y=None):
        return self

    def transform(self, dfs):
        df1, df2 = dfs
        merged_df = pd.merge(df1, df2, how='outer', left_on=['user_id'], right_on=['user_id']).fillna(0)
        return merged_df.values #(return shape (n_features, n_samples))


pipeline = Pipeline([
    ('union', DataMerging,
    ('other thing', ...)
])        

pipeline.fit(df1, df2)  

Upvotes: 0

Related Questions