Brad Dixon
Brad Dixon

Reputation: 198

How should I transform multiple key/value columns in a scikit-learn pipeline?

I'd like to build a sklearn pipeline to transform data that contains multiple key/value pairs:

import pandas as pd
D = pd.DataFrame([ ['a', 1, 'b', 2], ['b', 2, 'c', 3]], columns = ['k1', 'v1', 'k2', 'v2'])
print(D)

Output:

  k1  v1 k2  v2
0  a   1  b   2
1  b   2  c   3

DictVectorizer seems appropriate but I'm struggling with transforming multiple key/value columns present on each row into a suitable dict for processing.

DictVectorizer seems amenable to input like this:

row1 = {'a':1, 'b':2}
row2 = {'b':2, 'c':3}
data = [row1, row2]
# This is the output structure that I need:
print(data)

yielding:

[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]

Then it will transform into an array like this:

DictVectorizer( sparse=False ).fit_transform(data)

Final output:

array([[ 1.,  2.,  0.],
       [ 0.,  2.,  3.]])

What would be a suitable custom transformer to transform multiple key/value pairs as shown above?

Upvotes: 3

Views: 2386

Answers (2)

rabbit
rabbit

Reputation: 1476

Building on Mike's answer (which is definitely more elegant than my original one), you can use the same logic of pairs of columns and avoid having to specify each pair with the following:

[dict((row[i-1],row[i]) for i in np.arange(1,len(D.columns),2)) for index, row in D.iterrows() ]

This yields the following:

[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]

Note: This assumes that the pairs are organized like in your example (k1,v1,k2,v2, etc) and that there are an even number of columns.

Upvotes: 0

Mike
Mike

Reputation: 7203

I don't know about a special transformer but you could use a simple list comprehension:

>>> data = [{row['k1']:row['v1'], row['k2']:row['v2']} for index, row in D.iterrows()]
>>> data
[{'a': 1, 'b': 2}, {'c': 3, 'b': 2}]

From here you could use a dict vectorizer like this:

>>> v = sklearn.feature_extraction.DictVectorizer(sparse=False)
>>> X = v.fit_transform(data)
>>> print X
[[ 1.  2.  0.]
 [ 0.  2.  3.]]

Upvotes: 4

Related Questions