Jeevs
Jeevs

Reputation: 719

Specifying the output file name in Apache Spark

I have a MapReduce job that I'm trying to migrate to PySpark. Is there any way of defining the name of the output file, rather than getting part-xxxxx?

In MR, I was using the org.apache.hadoop.mapred.lib.MultipleTextOutputFormat class to achieve this,

PS: I did try the saveAsTextFile() method. For example:

lines = sc.textFile(filesToProcessStr)
counts = lines.flatMap(lambda x: re.split('[\s&]', x.strip()))\
.saveAsTextFile("/user/itsjeevs/mymr-output")

This will create the same part-0000 files.

[13:46:25] [spark] $ hadoop fs -ls /user/itsjeevs/mymr-output/
Found 3 items
-rw-r-----   2 itsjeevs itsjeevs          0 2014-08-13 13:46 /user/itsjeevs/mymr-output/_SUCCESS
-rw-r--r--   2 itsjeevs itsjeevs  101819636 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00000
-rw-r--r--   2 itsjeevs itsjeevs   17682682 2014-08-13 13:46 /user/itsjeevs/mymr-output/part-00001

EDIT

Recently read the article which would make life much easier for Spark users.

Upvotes: 6

Views: 9067

Answers (3)

ciurlaro
ciurlaro

Reputation: 1024

This is the closest I could get in python to the accepted answer:

def saveAsSingleJsonFile(df, path, filename):
    rdd = df.toJSON()
    rdd.map(lambda x: (None, x)).saveAsNewAPIHadoopFile(
        f"{path}/{filename}",
        "org.apache.hadoop.mapreduce.lib.output.TextOutputFormat",
        keyClass="org.apache.hadoop.io.NullWritable",
        valueClass="org.apache.hadoop.io.Text"
    )

Upvotes: 0

Neethu Lalitha
Neethu Lalitha

Reputation: 3071

Your output files will look like;

myfilename-r-00000 myfilename-r-00001

        SparkConf sparkConf = new SparkConf().setAppName("WCSYNC-FileCompressor-ClusterSaver");
        SparkContext sc = new SparkContext(sparkConf);
            JavaSparkContext context = new JavaSparkContext(sc)
context.hadoopConfiguration().set("mapreduce.output.basename", "myfilename");




saveAsNewAPIHadoopFile(outputpath,
                                Text.class,
                                Text.class,
                                TextOutputFormat.class,
                                context.hadoopConfiguration());

Upvotes: -1

Daniel Darabos
Daniel Darabos

Reputation: 27455

Spark is also using Hadoop under the hood, so you can probably get what you want. This is how saveAsTextFile is implemented:

def saveAsTextFile(path: String) {
  this.map(x => (NullWritable.get(), new Text(x.toString)))
    .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
}

You could pass in a customized OutputFormat to saveAsHadoopFile. I have no idea how to do that from Python though. Sorry for the incomplete answer.

Upvotes: 4

Related Questions