Shankar
Shankar

Reputation: 8957

saveAsTextFile() to write the final RDD as single text file - Apache Spark

I am working on a batch application using Apache Spark, i wanted to write the final RDD as text file, currently i am using saveAsTextFile("filePath") method available in RDD.

My text file contains the fields delimited with \u0001 delimiter. So in the model class toString() method i added all the fields seperated with \u0001 delimiter.

is this the correct way to handle this? or any other best approach available?

Also what if i iterate the RDD and write the file content using FileWriter class available in Java?

Please advise on this.

Regards, Shankar

Upvotes: 1

Views: 3939

Answers (3)

Sri Harsha
Sri Harsha

Reputation: 74

Instead of doing collect and collecting it to driver I would rather suggest to use coalesce which would be good in reducing memory problems

Upvotes: 0

meeza
meeza

Reputation: 704

public static boolean copyMerge(SparkConf sparkConf, JavaRDD rdd, String dstPath) throws IOException, URISyntaxException {
    Configuration hadoopConf = sparkConf.hadoopConfiguration();
    hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
    hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
    String tempFolder = "s3://bucket/folder";
    rdd.saveAsTextFile(tempFolder);
    FileSystem hdfs = FileSystem.get(new URI(tempFolder), hadoopConfig);
    return FileUtil.copyMerge(hdfs, new Path(tempFolder), hdfs, new Path(dstPath), false, hadoopConfig, null);
}

This solution is for S3 or any HDFS system. Achieved in two steps:

  1. Save the RDD by saveAsTextFile, this generates multiple files in the folder.

  2. Run Hadoop "copyMerge".

Upvotes: 0

Holden
Holden

Reputation: 7452

To write as a single file there are a few options. If your writing to HDFS or a similar distributed store you can first coalesce your RDD down to a single partition (note your data must then fit on a single worker), or you could collect the data to the driver and then use a filewriter.

Upvotes: 4

Related Questions