Reputation: 62
Trying to work on multi-label classification for a huge data set. I have got around 4000 unique labels and so when I try to use LabelBinarizer().fit(yTuple)/transform it just takes forever. Based on the number of lables and number of rows (6 million rows) is it normal or am I doing something wrong?
Laptop config: Mac,i5 Quad core,16 GB RAM with enough hard disk space left(around 250 GB free)
Code is simple but still pasting it here:
yTuple = [tuple(item.split(' ')) for item in getY(filepath)]
lb = LabelBinarizer().fit(yTuple)
Y_indicator = lb.transform(yTuple)
getY(filepath) - This will return label set for one row at a time.
Upvotes: 1
Views: 480
Reputation: 832
provided you have enough memory you might want to try to use use pandas get_dummies instead.
Upvotes: 1
Reputation: 363818
The label array will take roughly 4000 * 6e6 * 8 bytes, which is 179GB. scikit-learn isn't up to such massively multiclass classification out of the box.
Upvotes: 2