Reputation: 23592
I have some data like:
name | cost
-----------------
milk | 1.20
butter | 3.50
eggs | .99
I'd like to be able to group these by both name and price (e.g., "skim milk" is related to "whole milk" and five things that are $2.99 would be scored as similar) using one of SciKit's clustering modules, like kmeans.
I know I have to turn the names into numbers, which I'm doing using the HashingVectorizer
, which gives me a sparse matrix. However, at that point, I'm not sure how to combine that with the costs column. The examples I've seen take the output of HashingVectorizer and feed it straight into kmeans.fit
.
Can someone give (or point me towards) an example that uses text data + at least one other column?
Upvotes: 0
Views: 1400
Reputation: 476
You can use an 'ItemSelector' as in the example in the link below to have different methods of processing for each column in the data and then use sklearn's FeatureUnion to put everything back together.
I think what you are saying is you want to process the prices like they are categorical data? I would say convert the numbers to strings and then use CountVectorizer with the binary flag set to True.
This is a really useful example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#sphx-glr-auto-examples-hetero-feature-union-py
EDIT: Also, the count vectorizer will strip out periods I think, so probably best to call it without a tokenizer like this. There is probably a more elegant solution in fewer lines but this works too.
def no_tokenizer(t):
return [t]
CountVectorizer(binary=False, tokenizer=no_tokenizer)
EDIT: And here is an example of a pipeline with ItemSelector. I have mine set for a pandas dataframe and I pass whether I want to get back text, numerical data by passing a keyword.
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, key, dt):
self.key = key
self.dt = dt
def fit(self, x, y=None):
# does nothing
return self
def transform(self, data_dict):
# this returns the column requested
print "Selecting",self.key
if self.dt == 'text':
return data_dict.loc[:,self.key]
elif self.dt == 'num2text':
return data_dict.loc[:,self.key].astype(unicode)
elif self.dt == 'date':
return (data_dict.loc[:,self.key].squeeze() - pd.Timestamp(1900,1,1)).dt.days.to_frame()
else:
return data_dict.loc[:,[self.key]].astype(float)
def preproc_pipeline():
return FeatureUnion(
transformer_list=[
('artist_name', Pipeline([
('selector', ItemSelector(key='artist_name', dt='text')),
('cv', CountVectorizer(binary=False),
])),
('composer', Pipeline([
('selector', ItemSelector(key='composer', dt='text')),
('cv', CountVectorizer(binary=False),
])),
('song_length', Pipeline([
('selector', ItemSelector(key='song_length', dt='num')),
])),
])
Upvotes: 2
Reputation: 5696
You can use one-hot encoding.
import pandas as pd
# you can easily load the data using pd.read_csv()
# Or, if the data is in a numpy array, just use pd.Dataframe(data) and pass the appropriate column names to columns parameter as a list of strings
# For this example,
df = pd.DataFrame({'name':['milk', 'butter', 'eggs'],
'cost':[10, 3.50, 0.99]})
print(df)
name cost
0 milk 10.00
1 butter 3.50
2 eggs 0.99
df=pd.get_dummies(data=df, columns=['name']) # indicates that we want to encode the name column
print(df)
cost name_butter name_eggs name_milk
0 10.00 0 0 1
1 3.50 1 0 0
2 0.99 0 1 0
You can obtain the data set by df.values
array([[ 10. , 0. , 0. , 1. ],
[ 3.5 , 1. , 0. , 0. ],
[ 0.99, 0. , 1. , 0. ]])
Upvotes: 1