Reputation: 719
I have a MapReduce job that I'm trying to migrate to PySpark. Is there any way of defining the name of the output file, rather than getting part-xxxxx
?
In MR, I was using the org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class to achieve this,
PS: I did try the saveAsTextFile()
method. For example:
lines = sc.textFile(filesToProcessStr)
counts = lines.flatMap(lambda x: re.split('[\s&]', x.strip()))\
.saveAsTextFile("/user/itsjeevs/mymr-output")
This will create the same part-0000
files.
[13:46:25] [spark] $ hadoop fs -ls /user/itsjeevs/mymr-output/
Found 3 items
-rw-r----- 2 itsjeevs itsjeevs 0 2014-08-13 13:46 /user/itsjeevs/mymr-output/_SUCCESS
-rw-r--r-- 2 itsjeevs itsjeevs 101819636 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00000
-rw-r--r-- 2 itsjeevs itsjeevs 17682682 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00001
EDIT
Recently read the article which would make life much easier for Spark users.
Upvotes: 6
Views: 9067
Reputation: 1024
This is the closest I could get in python to the accepted answer:
def saveAsSingleJsonFile(df, path, filename):
rdd = df.toJSON()
rdd.map(lambda x: (None, x)).saveAsNewAPIHadoopFile(
f"{path}/{filename}",
"org.apache.hadoop.mapreduce.lib.output.TextOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.apache.hadoop.io.Text"
)
Upvotes: 0
Reputation: 3071
Your output files will look like;
myfilename-r-00000 myfilename-r-00001
SparkConf sparkConf = new SparkConf().setAppName("WCSYNC-FileCompressor-ClusterSaver");
SparkContext sc = new SparkContext(sparkConf);
JavaSparkContext context = new JavaSparkContext(sc)
context.hadoopConfiguration().set("mapreduce.output.basename", "myfilename");
saveAsNewAPIHadoopFile(outputpath,
Text.class,
Text.class,
TextOutputFormat.class,
context.hadoopConfiguration());
Upvotes: -1
Reputation: 27455
Spark is also using Hadoop under the hood, so you can probably get what you want. This is how saveAsTextFile
is implemented:
def saveAsTextFile(path: String) {
this.map(x => (NullWritable.get(), new Text(x.toString)))
.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
}
You could pass in a customized OutputFormat
to saveAsHadoopFile
. I have no idea how to do that from Python though. Sorry for the incomplete answer.
Upvotes: 4