A.Dumas
A.Dumas

Reputation: 3267

How to read text file and convert it to a Dataset in Java Spark?

I want to read in two text files with data and run some machine learning classification on the data in my Java Spark project

Let fileZero and fileOne be two files containing the data in the following form

>fileZero
10 1 
9.8 1.2 
10.1 0.9
....

And the other file

>fileOne
0.1 40 
0.2 38 
0 50
...

For fileZero and fileOne each line contains a (x,y) tuple separated by a space and is labeled labeled 0 and 1 respectively. In other words all rows in fileZero are supposed to be labelled 0 and for fileOne 1.

I want to read in both files and was thinking of using the object Dataset. How can read in the two files that later I can run classifation/ logistic regression on the data?

Upvotes: 2

Views: 9213

Answers (1)

kkurt
kkurt

Reputation: 450

You can define the pojo object and read the files into object.

MyObject

public class MyObject {
private double x;
private double y;
private double label;
//Getters and setters
...
}

You can read and convert files to dataset like this :

JavaRDD<MyObject> cRDD = spark.read().textFile("C:/Temp/File0.csv").javaRDD()
                       .map(new Function<String, MyObject>() {
                              @Override
                              public MyObject call(String line) throws Exception {
                                     String[] parts = line.split(" ");
                                     MyObject c = new MyObject();
                                     c.setX(parts[0].trim());
                                     c.setY(parts[1].trim());                                   
                                     c.setLabel(0);
                                     return c;
                              }
                       });


          Dataset<Row> mainDataset = spark.createDataFrame(cRDD, MyObject.class);   

and then you can use classification methods...

Upvotes: 5

Related Questions