Reputation: 615
I am reading a txt file as a JavaRDD with the following command:
JavaRDD<String> vertexRDD = ctx.textFile(pathVertex);
Now, I would like to convert this to a JavaRDD because in that txt file I have two columns of Integers and want to add some schema to the rows after splitting the columns.
I tried also this:
JavaRDD<Row> rows = vertexRDD.map(line -> line.split("\t"))
But is says I cannot assign the map function to an "Object" RDD
Thanks!
Upvotes: 2
Views: 3391
Reputation: 11
You can define this two columns as a class's field, and then you can use
JavaRDD<Row> rows = rdd.map(new Function<ClassName, Row>() {
@Override
public Row call(ClassName target) throws Exception {
return RowFactory.create(
target.getField1(),
target.getUsername(),
}
});
And then create StructField, finally using
StructType struct = DataTypes.createStructType(fields);
Dataset<Row> dataFrame = sparkSession.createDataFrame(rows, struct);
Upvotes: 1
Reputation: 10406
Creating a JavaRDD
out of another is implicit when you apply a transformation such as map
. Here, the RDD you create is a RDD of arrays of strings (result of split
).
To get a RDD of rows, just create a Row from the array:
JavaRDD<String> vertexRDD = ctx.textFile("");
JavaRDD<String[]> rddOfArrays = vertexRDD.map(line -> line.split("\t"));
JavaRDD<Row> rddOfRows =rddOfArrays.map(fields -> RowFactory.create(fields));
Note that if your goal is then to transform the JavaRDD<Row>
to a dataframe (Dataset<Row>
), there is a simpler way. You can change the delimiter option when using spark.read
to avoid having to use RDDs:
Dataset<Row> dataframe = spark.read()
.option("delimiter", "\t")
.csv("your_path/file.csv");
Upvotes: 1