Reputation: 3539

Scikit: Convert one-hot encoding to encoding with integers

I need to convert one-hot encoding to categories represented by unique integers. So one-hot encoding created with the following code:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
labels = [[1],[2],[3]]
enc.fit(labels)  
for x in [1,2,3]:
    print(enc.transform([[x]]).toarray())

Out:
[[ 1.  0.  0.]]
[[ 0.  1.  0.]]
[[ 0.  0.  1.]]

Could be converted back to a set of unique integers, for example:

[1,2,3] or [11,37, 45] or any other where each integer uniquely represents a single class.

Is it possible to do with scikit-learn or any other python lib?

* Update *

Tried to:

labels = [[1],[2],[3], [4], [5],[6],[7]]
enc.fit(labels) 

lst = []
for x in [1,2,3,4,5,6,7]:
    lst.append(enc.transform([[x]]).toarray())
lst
Out:
[array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.]]),
 array([[ 0.,  1.,  0.,  0.,  0.,  0.,  0.]]),
 array([[ 0.,  0.,  1.,  0.,  0.,  0.,  0.]]),
 array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.]]),
 array([[ 0.,  0.,  0.,  0.,  1.,  0.,  0.]]),
 array([[ 0.,  0.,  0.,  0.,  0.,  1.,  0.]]),
 array([[ 0.,  0.,  0.,  0.,  0.,  0.,  1.]])]


a = np.array(lst)
np.where(a==1)[1]
Out:
array([0, 0, 0, 0, 0, 0, 0], dtype=int64)

Not what I need

Upvotes: 6

Answers (3)

KamKam

Reputation: 564

Since you are using sklearn.preprocessing.OneHotEncoder to 'encode' the data, you can use its .inverse_transform() method to 'decode' the data (I think this requires .__version__ = 0.20.1 or newer):

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
labels = [[1],[2],[3]]
encoder = enc.fit(labels)
encoded_labels = encoder.transform(labels)
decoded_labels = encoder.inverse_transform(encoded_labels)
decoded_labels # array([[1],
                        [2],
                        [3]])

n.b. decoded_labels is a numpy array not a list.

Source: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder.inverse_transform

Upvotes: 0

ml4294

Reputation: 2629

You can use np.argmax():

from sklearn.preprocessing import OneHotEncoder
import numpy as np

enc = OneHotEncoder()
labels = [[1],[2],[3]]
enc.fit(labels)  
x = enc.transform(labels).toarray()


# x = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
xr = (np.argmax(x, axis=1)+1).reshape(-1, 1)
print(xr)

This should return array([[1], [2], [3]]). If you want instead array([[0], [1], [2]]), just remove the +1 in the definition of xr.

Upvotes: 2

Miriam Farber

Reputation: 19664

You can do that using np.where as follows:

import numpy as np
a=np.array([[ 0.,  1.,  0.],
            [ 1.,  0.,  0.],
            [ 0.,  0.,  1.]])
np.where(a==1)[1]

This prints array([1, 0, 2], dtype=int64). This works since np.where(a==1)[1] returns the column indices of the 1's, which are exactly the labels.

In addition, since a is a 0,1-matrix, you can also replace np.where(a==1)[1] with just np.where(a)[1].

Update: The following solution should work with your format:

l=[np.array([[ 1.,  0.,  0.,  0.,  0.,  0.,  0.]]),
 np.array([[ 0.,  0.,  1.,  0.,  0.,  0.,  0.]]),
 np.array([[ 0.,  1.,  0.,  0.,  0.,  0.,  0.]]),
 np.array([[ 0.,  0.,  0.,  0.,  1.,  0.,  0.]]),
 np.array([[ 0.,  0.,  0.,  0.,  1.,  0.,  0.]]),
 np.array([[ 0.,  0.,  0.,  0.,  0.,  1.,  0.]]),
 np.array([[ 0.,  0.,  0.,  0.,  0.,  0.,  1.]])]
a=np.array(l)

np.where(a)[2]

This prints

array([0, 2, 1, 4, 4, 5, 6], dtype=int64)

Alternativaly, you could use the original solution together with @ml4294's comment.

Upvotes: 8

Scikit: Convert one-hot encoding to encoding with integers

Answers (3)

Related Questions