Sparkan
Sparkan

Reputation: 159

Best way to build LabeledPoint of Features for Apache Spark MLlib in Java

I am preparing Data that contains Ids (Labels) and Keywords (Features) to pass them to MLlib algorithms, in Java. My keywords are strings separated with commas. My goal is to use multiclass classification algorithms to predict the id. The question is, how do I build the Labeledpoint Vector?

I tried this transformation below but i am getting a low percision (30%). It is worth to mention that when I use my own KNN classification code (plain java) I get over 70% percision.

Feature Tranformation:

        Tokenizer tokenizer = new Tokenizer().setInputCol("keywords")
                .setOutputCol("words");

        DataFrame wordsData = tokenizer.transform(df);
        wordsData.show();
        int numFeatures = 35;
        HashingTF hashingTF = new HashingTF().setInputCol("words")
                .setOutputCol("rawFeatures").setNumFeatures(numFeatures);
        DataFrame featurizedData = hashingTF.transform(wordsData);
        //featurizedData.show();
        featurizedData.cache();
        IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol(
                "features");
        IDFModel idfModel = idf.fit(featurizedData);
        DataFrame rescaledData = idfModel.transform(featurizedData);
        JavaRDD<Row> rescaledRDD = rescaledData.select("features", "id")
                .toJavaRDD();
        JavaRDD<LabeledPoint> test = rescaledRDD
                .map(new MakeLabledPointRDD());

Is this the right way to cenvert an RDD row to labeledpoint with a sparse vector? Do I need to count the keywords and use CountVectorizer? Else What is the best way to build it?

public static class MakeLabledPointRDD implements
        Function<Row, LabeledPoint> {

    @Override
    public LabeledPoint call(Row r) throws Exception {
        Vector features = r.getAs(0); //keywords in RDD
        Integer str = r.getInt(1); //id in RDD
        Double label = (double) str;
        LabeledPoint lp = new LabeledPoint(label, features);
        return lp;
    }
}

Upvotes: 1

Views: 653

Answers (1)

AHH
AHH

Reputation: 1083

Your MakeLabledPointRDD seems to be correct. However, the TFIDF transfomation seems to be a local one that works on row level. This means that the weights you are getting are actually meant for each instance of the identity.

All you need to do is to group the rows by ID before creating TFIDF vectors, i.e. your df variable should contain only one row pro ID.

Upvotes: 0

Related Questions