Reputation: 917
There seems no easy way to one-hot-encode data that has no order. My question is, what is the best way to one-hot-encode values that have no particular order? And if there is no standardised way to do this, why should one-hot-encoded features be ordered?
I am trying to one-hot-encode a set of features where the values are custom objects. My object looks like this:
class MyObject(object)
def __init__(self, identity):
self.identity = identity
def __hash__(self):
return self.identity
def __eq__(self, other):
return self.identity == other.identity
In this setting each instance of MyObject can be compared on equality. Suppose we have the following list of objects:
objects = [MyObject(0), MyObject(1), MyObject(0)]
The function set(objects)
yields a set of 2 objects, namely MyObject(0)
and MyObject(1)
. This is indeed the behaviour that I expect. Therefore, when I try to one-hot-encode this data, I would expect something in the form of:
index MyObject_0, MyObject_1
0 1 0
1 0 1
2 1 0
However, all solutions that I tried require data to be one-hot-encoded to have some sort of order, whereas that is undefined in my case. I think it should still be possible to have a one-hot-encoding if the order is undefined as in that case it does not matter which one-hot-encoded feature is before the other.
My first attempted solution was using pandas' get_dummies()
function.
import pandas as pd
objects = [MyObject(0), MyObject(1), MyObject(0)]
dataframe = pd.DataFrame({'MyObjectFeature': objects})
dummies = pd.get_dummies(dataframe)
However, this example gives a TypeError:
TypeError: 'values' is not ordered, please explicitly specify the categories order by passing in a categories argument.
My second attempt was using Scikit-learn's LabelEncoder
to encode the values before putting them into a OneHotEncoder
object. However, in the LabelEncoder
the same problem as using Pandas dataframes arises.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
objects = [MyObject(0), MyObject(1), MyObject(0)]
encoder = LabelEncoder()
dummies = encoder.fit_transform(objects)
This example also gives a TypeError:
TypeError: '<' not supported between instances of 'MyObject' and 'MyObject'
I also created my own UnorderedLabelEncoder
object to encode labels without requiring an order. This works fine, but I would like to know if there is a standard solution to my problem, i.e. using well-known libraries. Or if this is not the case, I would like to know if there is a reason for requiring ordered features?
class UnorderedLabelEncoder(object):
def __init__(self):
""" CustomLabelEncoder is capable of handling any
hashable object including None values.
"""
self.classes_ = dict()
def fit(self, y):
""" Fit label encoder.
Parameters
----------
y : array-like of shape (n_samples,)
Target values.
Returns
-------
self : returns an instance of self.
"""
self.classes_ = {o:i for i, o in enumerate(set(y))}
return self
def fit_transform(self, y):
""" Fit label encoder and return encoded labels.
Parameters
----------
y : array-like of shape [n_samples]
Target values.
Returns
-------
y : array-like of shape [n_samples]
"""
self.fit(y)
return self.transform(y)
def transform(self, y):
""" Transform labels to normalized encoding.
Parameters
----------
y : array-like of shape [n_samples]
Target values.
Returns
-------
y : array-like of shape [n_samples]
"""
return np.array([self.classes_.get(x, -1) for x in y])
Just to reiterate: My question is, what is the best way to one-hot-encode values that have no particular order? And if there is no standardised way to do this, why should one-hot-encoded features be ordered?
Upvotes: 1
Views: 626
Reputation: 16660
I would say that if the values do not have an intrinsic order (partial order) stemming from its type then you can define order artificially (something alike an artificial primary key in databases). Then this is the order that you impose on the data and going forwards you can use any method available for ordered data (as if there had been a [partial] order in the first place).
Upvotes: 2