Reputation: 497
I realize there's another question with a similar title, but my dataset is very different.
I have nearly 40 million rows and about 3 thousand labels. Running a simply sklearn train_test_split takes nearly 20 minutes.
I initially was using multi-class classification models as that's all I had experience with, and realized that since I needed to come up with all the possible labels a particular record could be tied to, I should be using a multi-label classification method.
I'm looking for recommendations on how to do this efficiently. I tried binary relevance, which took nearly 4 hours to train. Classifier chains errored out with a memory error after 22 hours. I'm afraid to try a label powerset as I've read they don't work well with a ton of data. Lastly, I've got adapted algorithm, MlkNN and then ensemble approaches (which I'm also worried about performance wise).
Does anyone else have experience with this type of problem and volume of data? In addition to suggested models, I'm also hoping for advice on best training methods, like train_test_split ratios or different/better methods.
Upvotes: 2
Views: 710
Reputation: 366
20 minutes for this size of a job doesn't seem that long, neither does 4 hours for training.
I would really try vowpal wabbit. It excels at this sort of multilabel problem and will probably give unmatched performance if that's what you're after. It requires significant tuning and will still require quality training data, but it's well worth it. This is essentially just a binary classification problem. An ensemble will of course take longer so consider whether or not it's necessary given your accuracy requirements.
Upvotes: 2