Reputation: 159
I am preparing Data that contains Ids (Labels) and Keywords (Features) to pass them to MLlib algorithms, in Java. My keywords are strings separated with commas. My goal is to use multiclass classification algorithms to predict the id. The question is, how do I build the Labeledpoint Vector?
I tried this transformation below but i am getting a low percision (30%). It is worth to mention that when I use my own KNN classification code (plain java) I get over 70% percision.
Feature Tranformation:
Tokenizer tokenizer = new Tokenizer().setInputCol("keywords")
.setOutputCol("words");
DataFrame wordsData = tokenizer.transform(df);
wordsData.show();
int numFeatures = 35;
HashingTF hashingTF = new HashingTF().setInputCol("words")
.setOutputCol("rawFeatures").setNumFeatures(numFeatures);
DataFrame featurizedData = hashingTF.transform(wordsData);
//featurizedData.show();
featurizedData.cache();
IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol(
"features");
IDFModel idfModel = idf.fit(featurizedData);
DataFrame rescaledData = idfModel.transform(featurizedData);
JavaRDD<Row> rescaledRDD = rescaledData.select("features", "id")
.toJavaRDD();
JavaRDD<LabeledPoint> test = rescaledRDD
.map(new MakeLabledPointRDD());
Is this the right way to cenvert an RDD row to labeledpoint with a sparse vector? Do I need to count the keywords and use CountVectorizer? Else What is the best way to build it?
public static class MakeLabledPointRDD implements
Function<Row, LabeledPoint> {
@Override
public LabeledPoint call(Row r) throws Exception {
Vector features = r.getAs(0); //keywords in RDD
Integer str = r.getInt(1); //id in RDD
Double label = (double) str;
LabeledPoint lp = new LabeledPoint(label, features);
return lp;
}
}
Upvotes: 1
Views: 653
Reputation: 1083
Your MakeLabledPointRDD seems to be correct. However, the TFIDF transfomation seems to be a local one that works on row level. This means that the weights you are getting are actually meant for each instance of the identity.
All you need to do is to group the rows by ID before creating TFIDF vectors, i.e. your df variable should contain only one row pro ID.
Upvotes: 0