Reputation: 5001
I am keep getting memory issues trying to encode string labels to one-hot encoding. There are around 5 million rows and around 10000 different labels. I have tried the following but keep getting memory errors:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
label_fitter = lb.fit(y)
y = label_fitter.transform(y)
I also tried something like this:
import numpy as np
def one_hot_encoding(y):
unique_values = set(y)
label_length = len(unique_values)
enu_uniq = zip(unique_values , range(len(unique_values)))
dict1 = dict(enu_uniq)
values = []
for i in y:
temp = np.zeros((label_length,), dtype="float32")
if i in dict1:
temp[dict1[i]] = 1.0
values.append(temp)
return np.array(values)
Still getting memory erros. Any tip? There are some people asking the same here in stack, but no answer seems kinda usefull.
Upvotes: 4
Views: 6497
Reputation: 3410
This may not have been available at the time the question was asked, but LabelBinarizer takes a sparse_output
argument.
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer(sparse_output=True)
Upvotes: 4
Reputation: 15909
Your main problem seem to be that the binarized y
doesn't fit into your memory. You can work with sparse arrays to avoid this.
>>> import numpy as np
>>> from scipy.sparse import csc_matrix
>>> y = np.random.randint(0, 10000, size=5000000) # 5M random integers [0,10K)
You can transform those labels y
to a 5M x 10K
sparse matrix as follows:
>>> dtype = np.uint8 # change to np.bool if you want boolean or other data type
>>> rows = np.arange(y.size) # each of the elements of `y` is a row itself
>>> cols = y # `y` indicates the column that is going to be flagged
>>> data = np.ones(y.size, dtype=dtype) # Set to `1` each (row,column) pair
>>> ynew = csc_matrix((data, (rows, cols)), shape=(y.size, y.max()+1), dtype=dtype)
ynew
is then a sparse matrix where each row is full of zeros except one entry:
>>> ynew
<5000000x10000 sparse matrix of type '<type 'numpy.uint8'>'
with 5000000 stored elements in Compressed Sparse Column format>
You will have to adapt your code to learn how to deal with sparse matrices, but is probably the best choice you have. Additionally, you can recover full rows or columns from the sparse matrix as:
>>> row0 = ynew[0].toarray() # row0 is a standard numpy array
For string labels or labels of arbitrary data type:
>>> y = ['aaa' + str(i) for i in np.random.randint(0, 10000, size=5000000)] # e.g. 'aaa9937'
First extract a mapping from labels to integers:
>>> labels = np.unique(y) # List of unique labels
>>> mapping = {u:i for i,u in enumerate(labels)}
>>> inv_mapping = {i:u for i,u in enumerate(labels)} # Only needed if you want to recover original labels at some point
The above mapping
maps each of the labels to an integer (based on the order that they are stored in the unique set labels
).
And then create the sparse matrix again:
>>> N, M = len(y), labels.size
>>> dtype = np.uint8 # change np.bool if you want boolean
>>> rows = np.arange(N)
>>> cols = [mapping[i] for i in y]
>>> data = np.ones(N, dtype=dtype)
>>> ynew = csc_matrix((data, (rows, cols)), shape=(N, M), dtype=dtype)
You can create (although is not needed) the inverse mapping if in the future you want to know label X
to which original label maps:
>>> inv_mapping = {i:u for i,u in enumerate(labels)}
>>> inv_mapping[10] # ---> something like 'aaaXXX'
Upvotes: 4