Reputation: 8957
I am working on a batch application using Apache Spark, i wanted to write the final RDD as text file, currently i am using saveAsTextFile("filePath")
method available in RDD.
My text file contains the fields delimited with \u0001
delimiter. So in the model class toString() method i added all the fields seperated with \u0001
delimiter.
is this the correct way to handle this? or any other best approach available?
Also what if i iterate the RDD and write the file content using FileWriter
class available in Java?
Please advise on this.
Regards, Shankar
Upvotes: 1
Views: 3939
Reputation: 74
Instead of doing collect and collecting it to driver I would rather suggest to use coalesce which would be good in reducing memory problems
Upvotes: 0
Reputation: 704
public static boolean copyMerge(SparkConf sparkConf, JavaRDD rdd, String dstPath) throws IOException, URISyntaxException {
Configuration hadoopConf = sparkConf.hadoopConfiguration();
hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
String tempFolder = "s3://bucket/folder";
rdd.saveAsTextFile(tempFolder);
FileSystem hdfs = FileSystem.get(new URI(tempFolder), hadoopConfig);
return FileUtil.copyMerge(hdfs, new Path(tempFolder), hdfs, new Path(dstPath), false, hadoopConfig, null);
}
This solution is for S3 or any HDFS system. Achieved in two steps:
Save the RDD by saveAsTextFile, this generates multiple files in the folder.
Run Hadoop "copyMerge".
Upvotes: 0
Reputation: 7452
To write as a single file there are a few options. If your writing to HDFS or a similar distributed store you can first coalesce
your RDD down to a single partition (note your data must then fit on a single worker), or you could collect
the data to the driver and then use a filewriter.
Upvotes: 4