Reputation: 3267
I want to read in two text files with data and run some machine learning classification on the data in my Java Spark project
Let fileZero
and fileOne
be two files containing the data in the following form
>fileZero
10 1
9.8 1.2
10.1 0.9
....
And the other file
>fileOne
0.1 40
0.2 38
0 50
...
For fileZero
and fileOne
each line contains a (x,y) tuple separated by a space and is labeled labeled 0 and 1 respectively. In other words all rows in fileZero
are supposed to be labelled 0 and for fileOne
1.
I want to read in both files and was thinking of using the object Dataset
.
How can read in the two files that later I can run classifation/ logistic regression on the data?
Upvotes: 2
Views: 9213
Reputation: 450
You can define the pojo object and read the files into object.
MyObject
public class MyObject {
private double x;
private double y;
private double label;
//Getters and setters
...
}
You can read and convert files to dataset like this :
JavaRDD<MyObject> cRDD = spark.read().textFile("C:/Temp/File0.csv").javaRDD()
.map(new Function<String, MyObject>() {
@Override
public MyObject call(String line) throws Exception {
String[] parts = line.split(" ");
MyObject c = new MyObject();
c.setX(parts[0].trim());
c.setY(parts[1].trim());
c.setLabel(0);
return c;
}
});
Dataset<Row> mainDataset = spark.createDataFrame(cRDD, MyObject.class);
and then you can use classification methods...
Upvotes: 5