Gabriela M
Gabriela M

Reputation: 615

JavaRDD<String> to JavaRDD<Row>

I am reading a txt file as a JavaRDD with the following command:

JavaRDD<String> vertexRDD = ctx.textFile(pathVertex);

Now, I would like to convert this to a JavaRDD because in that txt file I have two columns of Integers and want to add some schema to the rows after splitting the columns.

I tried also this:

JavaRDD<Row> rows = vertexRDD.map(line -> line.split("\t"))

But is says I cannot assign the map function to an "Object" RDD

  1. How can I create a JavaRDD out of a JavaRDD
  2. How can I use map to the JavaRDD?

Thanks!

Upvotes: 2

Views: 3391

Answers (2)

nxtabb
nxtabb

Reputation: 11

You can define this two columns as a class's field, and then you can use

JavaRDD<Row> rows = rdd.map(new Function<ClassName, Row>() {
            @Override
            public Row call(ClassName target) throws Exception {
                return RowFactory.create(
                        target.getField1(),
                        target.getUsername(),
            }
        });

And then create StructField, finally using

StructType struct = DataTypes.createStructType(fields);
Dataset<Row> dataFrame = sparkSession.createDataFrame(rows, struct);

Upvotes: 1

Oli
Oli

Reputation: 10406

Creating a JavaRDD out of another is implicit when you apply a transformation such as map. Here, the RDD you create is a RDD of arrays of strings (result of split).

To get a RDD of rows, just create a Row from the array:

JavaRDD<String> vertexRDD = ctx.textFile("");
JavaRDD<String[]> rddOfArrays = vertexRDD.map(line -> line.split("\t"));
JavaRDD<Row> rddOfRows =rddOfArrays.map(fields -> RowFactory.create(fields));

Note that if your goal is then to transform the JavaRDD<Row> to a dataframe (Dataset<Row>), there is a simpler way. You can change the delimiter option when using spark.read to avoid having to use RDDs:

Dataset<Row> dataframe = spark.read()
    .option("delimiter", "\t")
    .csv("your_path/file.csv");  

Upvotes: 1

Related Questions