Reputation: 137

Apache Spark MLlib - getting LabeledPoint from data (Java)

I am new to Apache Spark and was trying to convert data from .csv file into LabeledPoint to use MLlib package of Apache Spark. I tried the following code to get the LabeledPoint Data RDD using the following code but it turns out it was a LabeledPoint Data of ML Package. Now I want to get create the correct LabeledPoint Data of MLlib Package. Could anyone please help.

private static String appName = "learning_RDD";
private static String master = "spark://23.195.26.187:7077" ;
static SparkConf sparkConf = new SparkConf().setMaster("local[1]").setAppName("MLPipelineSample").set("spark.driver.memory", "512m").set("spark.sql.warehouse.dir","D:\\input.txt");
static SparkContext sc = new SparkContext(sparkConf);
static SparkSession spark = SparkSession
          .builder().sparkContext(sc)
          .getOrCreate();

public static void main(String args[]) throws IOException {

    Dataset<Row> trainingData = spark.read().format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("D:\\abc\\Spark\\WebcontentClassification_UsingSparkML\\WebcontentClassification_UsingSparkML\\NaiveBayes_ML_20ErrorRate\\nutchcsvalldata.csv");

    Tokenizer tokenizer = new Tokenizer().setInputCol("content").setOutputCol("words");
    Dataset<Row> words = tokenizer.transform(trainingData);

    StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filteredwords");
    Dataset<Row> filteredwords = remover.transform(words);

    HashingTF hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("filteredwords").setOutputCol("rawfeatures");
    Dataset<Row> hashedtf_Vector = hashingTF.transform(filteredwords);

    IDF idf = new IDF().setInputCol("rawfeatures").setOutputCol("features");
    IDFModel idfModel = idf.fit(hashedtf_Vector);
    Dataset<Row> Vectors = idfModel.transform(hashedtf_Vector);             

    Iterator<Row> iterator = Vectors.toLocalIterator();
    List<LabeledPointLabeledPoint> labeledpoints = new ArrayList<LabeledPoint>();
    while(iterator.hasNext())
    {
        Row r = iterator.next();
        int label = r.getAs(2);
        Vector v = r.getAs(16);
        LabeledPoint labeledpoint = new LabeledPoint(label, v);
        labeledpoints.add(labeledpoints);
    }

        // Here I am suppose convert the List into RDD<LabeledPoint> and use                           SVM Algorithm
}

Upvotes: 1

Answers (3)

aavos

Reputation: 137

I had found a solution (I am not posting the exact solution) and to start with converting words into vector the following code could be used, which can later be made to LabeledPoint

    JavaRDD<String> lines = spark.read().textFile(Input_file_path).toJavaRDD();
    JavaRDD<Iterable<String>> words_iterable = lines.map(new Function<String, Iterable<String>>() {
        public Iterable<String> call(String s) throws Exception {
            String[] words = s.split(" ");
            Iterable<String> output = Arrays.asList(words);
            return output;
        }
    });
    Word2Vec word2vec = new Word2Vec(); 
    Word2VecModel word2vecmodel =  word2vec.fit(words_iterable);

Upvotes: 0

OneCricketeer

Reputation: 191738

I want to get create the correct LabeledPoint Data of MLlib Package

import org.apache.spark.mllib.regression.LabeledPoint;

I am suppose convert the List into RDD and use

I think you need to map over Vectors and convert the RDD into a format you need.

I've been using Scala, but it might translate roughly like this

 RDD<LabeledPoint> training = Vectors.map(r -> {
    double label = (double) r.getAs(2); // labels should be doubles
    Vector v = r.getAs(16); // maybe convert this to a dense / sparse array
    return new LabeledPoint(label, v);
  });

Upvotes: 1

R M

Reputation: 56

Lets say you have four fields in each row in your csv file out of which first field is your label and rest of the three fields are your features (assuming all are double values). You can create your LabeledPoint RDD as follows:

JavaSparkContext sc = new JavaSparkContext(sparkConf);
        String path = "com.databricks.spark.csv";
        JavaRDD<String> data = sc.textFile(path);
        JavaRDD<LabeledPoint> parsedData = data
                .map(new Function<String, LabeledPoint>() {
                    public LabeledPoint call(String line) throws Exception {
                        String[] parts = line.split(",");
                        return new LabeledPoint(Double.parseDouble(parts[0]),
                                Vectors.dense(Double.parseDouble(parts[1]),
                                        Double.parseDouble(parts[2]),
                                        Double.parseDouble(parts[3])));
                    }
                });

Upvotes: 3

Apache Spark MLlib - getting LabeledPoint from data (Java)

Answers (3)

Related Questions