Thijs van Ede
Thijs van Ede

Reputation: 917

How to one hot encode unordered discrete data in python?

Problem

There seems no easy way to one-hot-encode data that has no order. My question is, what is the best way to one-hot-encode values that have no particular order? And if there is no standardised way to do this, why should one-hot-encoded features be ordered?

Example

I am trying to one-hot-encode a set of features where the values are custom objects. My object looks like this:

class MyObject(object)
    def __init__(self, identity):
        self.identity = identity

    def __hash__(self):
        return self.identity

    def __eq__(self, other):
        return self.identity == other.identity

In this setting each instance of MyObject can be compared on equality. Suppose we have the following list of objects:

objects = [MyObject(0), MyObject(1), MyObject(0)]

The function set(objects) yields a set of 2 objects, namely MyObject(0) and MyObject(1). This is indeed the behaviour that I expect. Therefore, when I try to one-hot-encode this data, I would expect something in the form of:

index   MyObject_0, MyObject_1
    0            1           0
    1            0           1
    2            1           0

However, all solutions that I tried require data to be one-hot-encoded to have some sort of order, whereas that is undefined in my case. I think it should still be possible to have a one-hot-encoding if the order is undefined as in that case it does not matter which one-hot-encoded feature is before the other.

Attempted solutions

Pandas dataframe

My first attempted solution was using pandas' get_dummies() function.

import pandas as pd

objects   = [MyObject(0), MyObject(1), MyObject(0)]
dataframe = pd.DataFrame({'MyObjectFeature': objects})
dummies   = pd.get_dummies(dataframe)

However, this example gives a TypeError:

TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument.

Scikit-learn LabelEncoder & OneHotEncoder

My second attempt was using Scikit-learn's LabelEncoder to encode the values before putting them into a OneHotEncoder object. However, in the LabelEncoder the same problem as using Pandas dataframes arises.

from sklearn.preprocessing  import LabelEncoder, OneHotEncoder

objects = [MyObject(0), MyObject(1), MyObject(0)]
encoder = LabelEncoder()
dummies = encoder.fit_transform(objects)

This example also gives a TypeError:

TypeError: '<' not supported between instances of 'MyObject' and 'MyObject'

Custom solution

I also created my own UnorderedLabelEncoder object to encode labels without requiring an order. This works fine, but I would like to know if there is a standard solution to my problem, i.e. using well-known libraries. Or if this is not the case, I would like to know if there is a reason for requiring ordered features?

class UnorderedLabelEncoder(object):

    def __init__(self):
        """ CustomLabelEncoder is capable of handling any
            hashable object including None values.
            """
        self.classes_ = dict()

    def fit(self, y):
        """ Fit label encoder.

            Parameters
            ----------
            y : array-like of shape (n_samples,)
                Target values.

            Returns
            -------
            self : returns an instance of self.
            """
        self.classes_ = {o:i for i, o in enumerate(set(y))}
        return self

    def fit_transform(self, y):
        """ Fit label encoder and return encoded labels.

            Parameters
            ----------
            y : array-like of shape [n_samples]
                Target values.

            Returns
            -------
            y : array-like of shape [n_samples]
        """
        self.fit(y)
        return self.transform(y)

    def transform(self, y):
        """ Transform labels to normalized encoding.

            Parameters
            ----------
            y : array-like of shape [n_samples]
                Target values.

            Returns
            -------
            y : array-like of shape [n_samples]
        """
        return np.array([self.classes_.get(x, -1) for x in y])

Question

Just to reiterate: My question is, what is the best way to one-hot-encode values that have no particular order? And if there is no standardised way to do this, why should one-hot-encoded features be ordered?

Upvotes: 1

Views: 626

Answers (1)

sophros
sophros

Reputation: 16660

I would say that if the values do not have an intrinsic order (partial order) stemming from its type then you can define order artificially (something alike an artificial primary key in databases). Then this is the order that you impose on the data and going forwards you can use any method available for ordered data (as if there had been a [partial] order in the first place).

Upvotes: 2

Related Questions