Best way to build LabeledPoint of Features for Apache Spark MLlib in Java

Question

I am preparing Data that contains Ids (Labels) and Keywords (Features) to pass them to MLlib algorithms, in Java. My keywords are strings separated with commas. My goal is to use multiclass classification algorithms to predict the id. The question is, how do I build the Labeledpoint Vector?

I tried this transformation below but i am getting a low percision (30%). It is worth to mention that when I use my own KNN classification code (plain java) I get over 70% percision.

Feature Tranformation:

        Tokenizer tokenizer = new Tokenizer().setInputCol("keywords")
                .setOutputCol("words");

        DataFrame wordsData = tokenizer.transform(df);
        wordsData.show();
        int numFeatures = 35;
        HashingTF hashingTF = new HashingTF().setInputCol("words")
                .setOutputCol("rawFeatures").setNumFeatures(numFeatures);
        DataFrame featurizedData = hashingTF.transform(wordsData);
        //featurizedData.show();
        featurizedData.cache();
        IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol(
                "features");
        IDFModel idfModel = idf.fit(featurizedData);
        DataFrame rescaledData = idfModel.transform(featurizedData);
        JavaRDD rescaledRDD = rescaledData.select("features", "id")
                .toJavaRDD();
        JavaRDD test = rescaledRDD
                .map(new MakeLabledPointRDD());

Is this the right way to cenvert an RDD row to labeledpoint with a sparse vector? Do I need to count the keywords and use CountVectorizer? Else What is the best way to build it?

public static class MakeLabledPointRDD implements
        Function {

    @Override
    public LabeledPoint call(Row r) throws Exception {
        Vector features = r.getAs(0); //keywords in RDD
        Integer str = r.getInt(1); //id in RDD
        Double label = (double) str;
        LabeledPoint lp = new LabeledPoint(label, features);
        return lp;
    }
}

Best way to build LabeledPoint of Features for Apache Spark MLlib in Java

Answers (1)

Related Questions