Reputation: 13850
Looking at Kaggel's Job Salary Prediction, I see numeric features (like Category) and textual ones (like FullDescription).
How do I go about training on such data? I thought about vectorizing the text using TfidfTransformer, however it creates sparse matrix which many learning algorithms (such as RandomForestRegressor) refuse to work with. Also, once I have the feature vector for the text, how do I combine it with other features?
Any pointers on how to work with such data?
Thanks!
Upvotes: 6
Views: 1723
Reputation: 40159
I would first learn a linear model on the tf-idf features of each text field independently and add the linear models predictions as a additional feature to the other features and train an ExtraTreesRegressor
or GradientBoostedTreeRegressor
on the combined features.
Upvotes: 5