kris
kris

Reputation: 23592

How to combine numeric and text data in a scikit dataset

I have some data like:

name    | cost
-----------------
milk    | 1.20
butter  | 3.50
eggs    |  .99

I'd like to be able to group these by both name and price (e.g., "skim milk" is related to "whole milk" and five things that are $2.99 would be scored as similar) using one of SciKit's clustering modules, like kmeans.

I know I have to turn the names into numbers, which I'm doing using the HashingVectorizer, which gives me a sparse matrix. However, at that point, I'm not sure how to combine that with the costs column. The examples I've seen take the output of HashingVectorizer and feed it straight into kmeans.fit.

Can someone give (or point me towards) an example that uses text data + at least one other column?

Upvotes: 0

Views: 1400

Answers (2)

Tasko Olevski
Tasko Olevski

Reputation: 476

You can use an 'ItemSelector' as in the example in the link below to have different methods of processing for each column in the data and then use sklearn's FeatureUnion to put everything back together.

I think what you are saying is you want to process the prices like they are categorical data? I would say convert the numbers to strings and then use CountVectorizer with the binary flag set to True.

This is a really useful example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py

EDIT: Also, the count vectorizer will strip out periods I think, so probably best to call it without a tokenizer like this. There is probably a more elegant solution in fewer lines but this works too.

def no_tokenizer(t):
    return [t]

CountVectorizer(binary=False, tokenizer=no_tokenizer)

EDIT: And here is an example of a pipeline with ItemSelector. I have mine set for a pandas dataframe and I pass whether I want to get back text, numerical data by passing a keyword.

class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key, dt):
        self.key = key
        self.dt = dt

    def fit(self, x, y=None):
        # does nothing
        return self

    def transform(self, data_dict):
        # this returns the column requested
        print "Selecting",self.key
        if self.dt == 'text':
            return data_dict.loc[:,self.key]
        elif self.dt == 'num2text':
            return data_dict.loc[:,self.key].astype(unicode)
        elif self.dt == 'date':
            return (data_dict.loc[:,self.key].squeeze() - pd.Timestamp(1900,1,1)).dt.days.to_frame()
        else:
            return data_dict.loc[:,[self.key]].astype(float)

def preproc_pipeline():
    return FeatureUnion(
        transformer_list=[
            ('artist_name', Pipeline([
                ('selector', ItemSelector(key='artist_name', dt='text')),
                ('cv', CountVectorizer(binary=False),
            ])),
            ('composer', Pipeline([
                ('selector', ItemSelector(key='composer', dt='text')),
                ('cv', CountVectorizer(binary=False),
            ])),
            ('song_length', Pipeline([
                ('selector', ItemSelector(key='song_length', dt='num')),
            ])),
        ])

Upvotes: 2

akilat90
akilat90

Reputation: 5696

Pandas

You can use one-hot encoding.

import pandas as pd

# you can easily load the data using pd.read_csv()
# Or, if the data is in a numpy array, just use pd.Dataframe(data) and pass the appropriate column names to columns parameter as a list of strings

# For this example,
df = pd.DataFrame({'name':['milk', 'butter', 'eggs'],
              'cost':[10, 3.50, 0.99]}) 

print(df)

     name   cost
0    milk  10.00
1  butter   3.50
2    eggs   0.99

df=pd.get_dummies(data=df, columns=['name']) # indicates that we want to encode the name column
print(df)

    cost  name_butter  name_eggs  name_milk
0  10.00            0          0          1
1   3.50            1          0          0
2   0.99            0          1          0

You can obtain the data set by df.values

array([[ 10.  ,   0.  ,   0.  ,   1.  ],
   [  3.5 ,   1.  ,   0.  ,   0.  ],
   [  0.99,   0.  ,   1.  ,   0.  ]])

Upvotes: 1

Related Questions