Reputation: 198
I'd like to build a sklearn pipeline to transform data that contains multiple key/value pairs:
import pandas as pd
D = pd.DataFrame([ ['a', 1, 'b', 2], ['b', 2, 'c', 3]], columns = ['k1', 'v1', 'k2', 'v2'])
print(D)
Output:
k1 v1 k2 v2
0 a 1 b 2
1 b 2 c 3
DictVectorizer
seems appropriate but I'm struggling with transforming multiple key/value columns present on each row into a suitable dict for processing.
DictVectorizer
seems amenable to input like this:
row1 = {'a':1, 'b':2}
row2 = {'b':2, 'c':3}
data = [row1, row2]
# This is the output structure that I need:
print(data)
yielding:
[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]
Then it will transform into an array like this:
DictVectorizer( sparse=False ).fit_transform(data)
Final output:
array([[ 1., 2., 0.],
[ 0., 2., 3.]])
What would be a suitable custom transformer to transform multiple key/value pairs as shown above?
Upvotes: 3
Views: 2386
Reputation: 1476
Building on Mike's answer (which is definitely more elegant than my original one), you can use the same logic of pairs of columns and avoid having to specify each pair with the following:
[dict((row[i-1],row[i]) for i in np.arange(1,len(D.columns),2)) for index, row in D.iterrows() ]
This yields the following:
[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]
Note: This assumes that the pairs are organized like in your example (k1,v1,k2,v2, etc) and that there are an even number of columns.
Upvotes: 0
Reputation: 7203
I don't know about a special transformer but you could use a simple list comprehension:
>>> data = [{row['k1']:row['v1'], row['k2']:row['v2']} for index, row in D.iterrows()]
>>> data
[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]
From here you could use a dict vectorizer like this:
>>> v = sklearn.feature_extraction.DictVectorizer(sparse=False)
>>> X = v.fit_transform(data)
>>> print X
[[ 1. 2. 0.]
[ 0. 2. 3.]]
Upvotes: 4