Moon
Moon

Reputation: 62

LabelBinarizer() Takes forever

Trying to work on multi-label classification for a huge data set. I have got around 4000 unique labels and so when I try to use LabelBinarizer().fit(yTuple)/transform it just takes forever. Based on the number of lables and number of rows (6 million rows) is it normal or am I doing something wrong?

Laptop config: Mac,i5 Quad core,16 GB RAM with enough hard disk space left(around 250 GB free)

Code is simple but still pasting it here:

yTuple = [tuple(item.split(' ')) for item in getY(filepath)]
lb = LabelBinarizer().fit(yTuple)
Y_indicator = lb.transform(yTuple)

getY(filepath) - This will return label set for one row at a time.

Upvotes: 1

Views: 480

Answers (2)

Diego
Diego

Reputation: 832

provided you have enough memory you might want to try to use use pandas get_dummies instead.

Upvotes: 1

Fred Foo
Fred Foo

Reputation: 363818

The label array will take roughly 4000 * 6e6 * 8 bytes, which is 179GB. scikit-learn isn't up to such massively multiclass classification out of the box.

Upvotes: 2

Related Questions