How should I transform multiple key/value columns in a scikit-learn pipeline?

Question

I'd like to build a sklearn pipeline to transform data that contains multiple key/value pairs:

import pandas as pd
D = pd.DataFrame([ ['a', 1, 'b', 2], ['b', 2, 'c', 3]], columns = ['k1', 'v1', 'k2', 'v2'])
print(D)

Output:

  k1  v1 k2  v2
0  a   1  b   2
1  b   2  c   3

DictVectorizer seems appropriate but I'm struggling with transforming multiple key/value columns present on each row into a suitable dict for processing.

DictVectorizer seems amenable to input like this:

row1 = {'a':1, 'b':2}
row2 = {'b':2, 'c':3}
data = [row1, row2]
# This is the output structure that I need:
print(data)

yielding:

[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]

Then it will transform into an array like this:

DictVectorizer( sparse=False ).fit_transform(data)

Final output:

array([[ 1.,  2.,  0.],
       [ 0.,  2.,  3.]])

What would be a suitable custom transformer to transform multiple key/value pairs as shown above?

Mike · Accepted Answer

I don't know about a special transformer but you could use a simple list comprehension:

>>> data = [{row['k1']:row['v1'], row['k2']:row['v2']} for index, row in D.iterrows()]
>>> data
[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]

From here you could use a dict vectorizer like this:

>>> v = sklearn.feature_extraction.DictVectorizer(sparse=False)
>>> X = v.fit_transform(data)
>>> print X
[[ 1.  2.  0.]
 [ 0.  2.  3.]]

How should I transform multiple key/value columns in a scikit-learn pipeline?

Answers (2)

Related Questions