Dhruv Ghulati
Dhruv Ghulati

Reputation: 3026

Using label encoder on a dictionary

I am using the sklearn LabelEncoder. I know how to use it for a 1D array, but my use case is as such:

I have multiple arrays of dicts like this (which is effectively the cost of me assigning each text label u'a',u'b' etc in a classifier), all within a dict:

{'open_model':    
[
    {u'a': 47502.125, u'c': 45.3, u'd': 2.3, u'e': 0.45},
    {u'b': 121, u'a': 1580.5625, u'c': 12, u'e': 62,u'd':0.343},
    {u'e': 12321, u'b': 4, u'a': 0.1112}
    ],
 'closed_model':
 [
    {u'a': 1231.22, u'c': 43.1},
    {u'b': 342.2, u'a': 121.1, u'c': 343},
    {u'b': 14.2, u'a': 53.2}
    ]
}

I need to be able to encode this into numerical labels and then decode all of them back, so for example:

[
    {1: 47502.125, 3: 45.3, 4: 2.3, 5: 0.45},
    {2: 121, 1: 1580.5625, 3: 12, 5: 62, 4: 0.343},
    {5: 12321, 2: 4, 1: 0.1112}
    ]

Which I use effectively to generate predictions of the best label for each row, so:

[5, 4, 1] perhaps in this case.

What I need to do is to be able to decode this back into:

[u'e',u'd', u'a'] perhaps in this case.

How can I get the same LabelEncoder functionality but to fit_transform on an array of dicts where the dict keys are my labels?

Note, dict within the array of dicts is a different length, but I do have list of all the potential labels, i.e. for the open_model labels, set([u'a',u'b',u'c',u'd',u'e']) and for the closed_model labels: set([u'a',u'b',u'c']).

Upvotes: 1

Views: 5879

Answers (2)

geompalik
geompalik

Reputation: 1582

Although it is a good practice to use already implemented functionality, you could easily achieve this with a couple of lines of code. Given your list input:

dico = [
{u'a': 47502.125, u'b': 1580.5625, u'c': 45.3, u'd': 2.3, u'e': 0.45},
{u'b': 121, u'a': 1580.5625, u'c': 12, u'e': 62, u'd': 0.343},
{u'e': 12321, u'b': 4, u'd': 5434, u'c': 2.3, u'a': 0.1112}
]

you can get the set of labels by simply:

keyset = set(dico[0].keys()) #Get the set of keys assuming they all appear in each list item. 
mapping = { val:key+1 for key,val in enumerate(list(keyset))} # Create a mapping from int -> str
inv_mapping = { key+1:val for key,val in enumerate(list(keyset))} # Create a mapping from str:int. 

Having the mapping and inv_mapping you can change the representation of your data by:

for inner_dict in dico:
    for key in inner_dict.keys():
        inner_dict[mapping[key]] = inner_dict.pop(key)
print dico

which will give you [{1: 47502.125, ...}] and then if needed:

for inner_dict in dico:
    for key in inner_dict.keys():
        inner_dict[inv_mapping[key]] = inner_dict.pop(key)
print dico

to get the initial version.

Also, and maybe more closely related to your issue, having your output [5, 4, 1] you can easily transform it by:

print [inv_mapping[i] for i in x]

Upvotes: 2

Alessandro Mariani
Alessandro Mariani

Reputation: 1221

It seems that you always have 'a', 'b', 'c', 'd', 'e'. If this is the case why don't you use pandas data frame and forget about the encoder? You kinda need to rewrite the keys of the dictionaries you use, so it's going to be messy anyway!

import pandas as pd
i = [
{u'a': 47502.125, u'b': 1580.5625, u'c': 45.3, u'd': 2.3, u'e': 0.45},
{u'b': 121, u'a': 1580.5625, u'c': 12, u'e': 62, u'd': 0.343},
{u'e': 12321, u'b': 4, u'd': 5434, u'c': 2.3, u'a': 0.1112}
]
# transform to data frame
df = pd.DataFrame(i)
print df
            a          b     c         d         e
0  47502.1250  1580.5625  45.3     2.300      0.45
1   1580.5625   121.0000  12.0     0.343     62.00
2      0.1112     4.0000   2.3  5434.000  12321.00

# create a mapping between columns and encoders
mapping = dict((k, v) for k, v in enumerate(df.columns))

# rename columns
df.columns = range(len(df.columns))

# print your new input data
print df.to_dict(orient='records)
[{0: 47502.125, 1: 1580.5625, 2: 45.3, 3: 2.3, 4: 0.45},
 {0: 1580.5625, 1: 121.0, 2: 12.0, 3: 0.343, 4: 62.0},
 {0: 0.1112, 1: 4.0, 2: 2.3, 3: 5434.0, 4: 12321.0}]

# translate prediction
prediction = [3, 4, 1]
print [mapping[k] for k in prediction]
[u'd', u'e', u'b']

It's not straight forward, but I guess it will take less time than using the encoder :)

Upvotes: 1

Related Questions