Mpizos Dimitris
Mpizos Dimitris

Reputation: 5001

Python: One-hot encoding for huge data

I am keep getting memory issues trying to encode string labels to one-hot encoding. There are around 5 million rows and around 10000 different labels. I have tried the following but keep getting memory errors:

from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
label_fitter = lb.fit(y)
y = label_fitter.transform(y)

I also tried something like this:

import numpy as np

def one_hot_encoding(y):
    unique_values = set(y)
    label_length = len(unique_values)
    enu_uniq = zip(unique_values , range(len(unique_values)))
    dict1 = dict(enu_uniq)
    values = []
    for i in y:
        temp = np.zeros((label_length,), dtype="float32")
        if i in dict1:
            temp[dict1[i]] = 1.0
        values.append(temp)
    return np.array(values)

Still getting memory erros. Any tip? There are some people asking the same here in stack, but no answer seems kinda usefull.

Upvotes: 4

Views: 6497

Answers (2)

Brecht Machiels
Brecht Machiels

Reputation: 3410

This may not have been available at the time the question was asked, but LabelBinarizer takes a sparse_output argument.

from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer(sparse_output=True)

Upvotes: 4

Imanol Luengo
Imanol Luengo

Reputation: 15909

Your main problem seem to be that the binarized y doesn't fit into your memory. You can work with sparse arrays to avoid this.

>>> import numpy as np
>>> from scipy.sparse import csc_matrix
>>> y = np.random.randint(0, 10000, size=5000000) # 5M random integers [0,10K)

You can transform those labels y to a 5M x 10K sparse matrix as follows:

>>> dtype = np.uint8 # change to np.bool if you want boolean or other data type
>>> rows = np.arange(y.size) # each of the elements of `y` is a row itself
>>> cols = y # `y` indicates the column that is going to be flagged
>>> data = np.ones(y.size, dtype=dtype) # Set to `1` each (row,column) pair
>>> ynew = csc_matrix((data, (rows, cols)), shape=(y.size, y.max()+1), dtype=dtype)

ynew is then a sparse matrix where each row is full of zeros except one entry:

>>> ynew
<5000000x10000 sparse matrix of type '<type 'numpy.uint8'>'
     with 5000000 stored elements in Compressed Sparse Column format>

You will have to adapt your code to learn how to deal with sparse matrices, but is probably the best choice you have. Additionally, you can recover full rows or columns from the sparse matrix as:

>>> row0 = ynew[0].toarray() # row0 is a standard numpy array

For string labels or labels of arbitrary data type:

>>> y = ['aaa' + str(i) for i in np.random.randint(0, 10000, size=5000000)] # e.g. 'aaa9937'

First extract a mapping from labels to integers:

>>> labels = np.unique(y) # List of unique labels
>>> mapping = {u:i for i,u in enumerate(labels)}
>>> inv_mapping = {i:u for i,u in enumerate(labels)} # Only needed if you want to recover original labels at some point

The above mapping maps each of the labels to an integer (based on the order that they are stored in the unique set labels).

And then create the sparse matrix again:

>>> N, M = len(y), labels.size
>>> dtype = np.uint8 # change np.bool if you want boolean
>>> rows = np.arange(N)
>>> cols = [mapping[i] for i in y]
>>> data = np.ones(N, dtype=dtype)
>>> ynew = csc_matrix((data, (rows, cols)), shape=(N, M), dtype=dtype)

You can create (although is not needed) the inverse mapping if in the future you want to know label X to which original label maps:

>>> inv_mapping = {i:u for i,u in enumerate(labels)}
>>> inv_mapping[10] # ---> something like 'aaaXXX'

Upvotes: 4

Related Questions