LabelBinarizer() Takes forever

Question

Trying to work on multi-label classification for a huge data set. I have got around 4000 unique labels and so when I try to use LabelBinarizer().fit(yTuple)/transform it just takes forever. Based on the number of lables and number of rows (6 million rows) is it normal or am I doing something wrong?

Laptop config: Mac,i5 Quad core,16 GB RAM with enough hard disk space left(around 250 GB free)

Code is simple but still pasting it here:

yTuple = [tuple(item.split(' ')) for item in getY(filepath)]
lb = LabelBinarizer().fit(yTuple)
Y_indicator = lb.transform(yTuple)

getY(filepath) - This will return label set for one row at a time.

Fred Foo · Accepted Answer

The label array will take roughly 4000 * 6e6 * 8 bytes, which is 179GB. scikit-learn isn't up to such massively multiclass classification out of the box.

LabelBinarizer() Takes forever

Answers (2)

Related Questions