How to combine numeric and text data in a scikit dataset

Question

I have some data like:

name    | cost
-----------------
milk    | 1.20
butter  | 3.50
eggs    |  .99

I'd like to be able to group these by both name and price (e.g., "skim milk" is related to "whole milk" and five things that are $2.99 would be scored as similar) using one of SciKit's clustering modules, like kmeans.

I know I have to turn the names into numbers, which I'm doing using the HashingVectorizer, which gives me a sparse matrix. However, at that point, I'm not sure how to combine that with the costs column. The examples I've seen take the output of HashingVectorizer and feed it straight into kmeans.fit.

Can someone give (or point me towards) an example that uses text data + at least one other column?

Tasko Olevski · Accepted Answer

You can use an 'ItemSelector' as in the example in the link below to have different methods of processing for each column in the data and then use sklearn's FeatureUnion to put everything back together.

I think what you are saying is you want to process the prices like they are categorical data? I would say convert the numbers to strings and then use CountVectorizer with the binary flag set to True.

This is a really useful example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py

EDIT: Also, the count vectorizer will strip out periods I think, so probably best to call it without a tokenizer like this. There is probably a more elegant solution in fewer lines but this works too.

def no_tokenizer(t):
    return [t]

CountVectorizer(binary=False, tokenizer=no_tokenizer)

EDIT: And here is an example of a pipeline with ItemSelector. I have mine set for a pandas dataframe and I pass whether I want to get back text, numerical data by passing a keyword.

class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key, dt):
        self.key = key
        self.dt = dt

    def fit(self, x, y=None):
        # does nothing
        return self

    def transform(self, data_dict):
        # this returns the column requested
        print "Selecting",self.key
        if self.dt == 'text':
            return data_dict.loc[:,self.key]
        elif self.dt == 'num2text':
            return data_dict.loc[:,self.key].astype(unicode)
        elif self.dt == 'date':
            return (data_dict.loc[:,self.key].squeeze() - pd.Timestamp(1900,1,1)).dt.days.to_frame()
        else:
            return data_dict.loc[:,[self.key]].astype(float)

def preproc_pipeline():
    return FeatureUnion(
        transformer_list=[
            ('artist_name', Pipeline([
                ('selector', ItemSelector(key='artist_name', dt='text')),
                ('cv', CountVectorizer(binary=False),
            ])),
            ('composer', Pipeline([
                ('selector', ItemSelector(key='composer', dt='text')),
                ('cv', CountVectorizer(binary=False),
            ])),
            ('song_length', Pipeline([
                ('selector', ItemSelector(key='song_length', dt='num')),
            ])),
        ])

How to combine numeric and text data in a scikit dataset

Answers (2)

Pandas

Related Questions