Python: One-hot encoding for huge data

Question

I am keep getting memory issues trying to encode string labels to one-hot encoding. There are around 5 million rows and around 10000 different labels. I have tried the following but keep getting memory errors:

from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
label_fitter = lb.fit(y)
y = label_fitter.transform(y)

I also tried something like this:

import numpy as np

def one_hot_encoding(y):
    unique_values = set(y)
    label_length = len(unique_values)
    enu_uniq = zip(unique_values , range(len(unique_values)))
    dict1 = dict(enu_uniq)
    values = []
    for i in y:
        temp = np.zeros((label_length,), dtype="float32")
        if i in dict1:
            temp[dict1[i]] = 1.0
        values.append(temp)
    return np.array(values)

Still getting memory erros. Any tip? There are some people asking the same here in stack, but no answer seems kinda usefull.

Imanol Luengo · Accepted Answer

Your main problem seem to be that the binarized y doesn't fit into your memory. You can work with sparse arrays to avoid this.

>>> import numpy as np
>>> from scipy.sparse import csc_matrix
>>> y = np.random.randint(0, 10000, size=5000000) # 5M random integers [0,10K)

You can transform those labels y to a 5M x 10K sparse matrix as follows:

>>> dtype = np.uint8 # change to np.bool if you want boolean or other data type
>>> rows = np.arange(y.size) # each of the elements of `y` is a row itself
>>> cols = y # `y` indicates the column that is going to be flagged
>>> data = np.ones(y.size, dtype=dtype) # Set to `1` each (row,column) pair
>>> ynew = csc_matrix((data, (rows, cols)), shape=(y.size, y.max()+1), dtype=dtype)

ynew is then a sparse matrix where each row is full of zeros except one entry:

>>> ynew
<5000000x10000 sparse matrix of type ''
     with 5000000 stored elements in Compressed Sparse Column format>

You will have to adapt your code to learn how to deal with sparse matrices, but is probably the best choice you have. Additionally, you can recover full rows or columns from the sparse matrix as:

>>> row0 = ynew[0].toarray() # row0 is a standard numpy array

For string labels or labels of arbitrary data type:

>>> y = ['aaa' + str(i) for i in np.random.randint(0, 10000, size=5000000)] # e.g. 'aaa9937'

First extract a mapping from labels to integers:

>>> labels = np.unique(y) # List of unique labels
>>> mapping = {u:i for i,u in enumerate(labels)}
>>> inv_mapping = {i:u for i,u in enumerate(labels)} # Only needed if you want to recover original labels at some point

The above mapping maps each of the labels to an integer (based on the order that they are stored in the unique set labels).

And then create the sparse matrix again:

>>> N, M = len(y), labels.size
>>> dtype = np.uint8 # change np.bool if you want boolean
>>> rows = np.arange(N)
>>> cols = [mapping[i] for i in y]
>>> data = np.ones(N, dtype=dtype)
>>> ynew = csc_matrix((data, (rows, cols)), shape=(N, M), dtype=dtype)

You can create (although is not needed) the inverse mapping if in the future you want to know label X to which original label maps:

>>> inv_mapping = {i:u for i,u in enumerate(labels)}
>>> inv_mapping[10] # ---> something like 'aaaXXX'

Python: One-hot encoding for huge data

Answers (2)

Related Questions