Reputation: 11827

How do I configure pyspark to write to HDFS by default?

I am trying to make spark write to HDFS by default. Currently, when I call saveAsTextFile on an RDD, it writes to my local filesystem. Specifically, if I do this:

rdd = sc.parallelize( [1,2,3,4,5] )
rdd.saveAsTextFile("/tmp/sample")

it will write to a file on my local file system called /tmp/sample. But, if I do

rdd = sc.parallelize( [1,2,3,4,5] )
rdd.saveAsTextFile("hdfs://localhost:9000/tmp/sample")

then it saves to the appropriate spot on my local hdfs instance.

Is there a way to configure or initialize spark such that

rdd.saveAsTextFile("/tmp/sample")

will save to HDFS by default?

To answer a commenter below, when I run

hdfs getconf -confKey fs.defaultFS

I see

17/11/28 09:47:18 WARN util.NativeCodeLoader: Unable to load native-hadoop   library for your platform... using builtin-java classes where applicable
hdfs://localhost:9000

Upvotes: 3

Answers (3)

xmorera

Reputation: 1961

There are different ways of running Spark. In my case I use two different ways, I have a Spark standalone installation and Spark on Yarn in a Cloudera cluster.

When I write in my Spark standalone by default it writes to the local filesystem, but when I do so in Spark on Yarn (it is 2.x) then HDFS is the default write location.

I know I am not answering your question of how to configure Spark to write by default in HDFS and you already figured it out, but I am telling you one way to deploy Spark where the default write location is HDFS.

I also believe in the benefit of deploying Spark in a Cloudera cluster as you get many nice additions like Cloudera manager for monitoring your resources beyond what the Spark UI and History Server provide, including log aggregation, HUE to help interact with HDFS, Hive and more.

Upvotes: 1

djacobs7

Reputation: 11827

Finally figured this out:

I had to create an environment variable called SPARK_CONF_DIR
I created a file in there called spark-env.sh
That file has a line like this one export HADOOP_CONF_DIR="/opt/hadoop-2.9.0/etc/hadoop/" ( or whereveer hadoop actually is installed. ) This is documented here: https://spark.apache.org/docs/latest/configuration.html

The "gotcha" turned out to be that HADOOP_CONF_DIR had to be a fully resolved path, without a ~. For a long time, I had

export HADOOP_CONF_DIR="~/opt/hadoop-2.9.0/etc/hadoop"

and that seems to not work correctly. Changing to an absolute path fixed the problem.

Upvotes: 5

Georges Kohnen

Reputation: 180

Short answer: no. The syntax of "/tmp/sample" points to your local filesystem by default. What the reason for not using rdd.saveAsTextFile("hdfs://localhost:9000/tmp/sample") ?

You could however store the path in a variable and broadcast it to the workers if necessary.

Upvotes: -2

How do I configure pyspark to write to HDFS by default?

Answers (3)

Related Questions