Reputation: 11827
I am trying to make spark write to HDFS by default. Currently, when I call saveAsTextFile on an RDD, it writes to my local filesystem. Specifically, if I do this:
rdd = sc.parallelize( [1,2,3,4,5] )
rdd.saveAsTextFile("/tmp/sample")
it will write to a file on my local file system called /tmp/sample. But, if I do
rdd = sc.parallelize( [1,2,3,4,5] )
rdd.saveAsTextFile("hdfs://localhost:9000/tmp/sample")
then it saves to the appropriate spot on my local hdfs instance.
Is there a way to configure or initialize spark such that
rdd.saveAsTextFile("/tmp/sample")
will save to HDFS by default?
To answer a commenter below, when I run
hdfs getconf -confKey fs.defaultFS
I see
17/11/28 09:47:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hdfs://localhost:9000
Upvotes: 3
Views: 4230
Reputation: 1961
There are different ways of running Spark. In my case I use two different ways, I have a Spark standalone installation and Spark on Yarn in a Cloudera cluster.
When I write in my Spark standalone by default it writes to the local filesystem, but when I do so in Spark on Yarn (it is 2.x) then HDFS is the default write location.
I know I am not answering your question of how to configure Spark to write by default in HDFS and you already figured it out, but I am telling you one way to deploy Spark where the default write location is HDFS.
I also believe in the benefit of deploying Spark in a Cloudera cluster as you get many nice additions like Cloudera manager for monitoring your resources beyond what the Spark UI and History Server provide, including log aggregation, HUE to help interact with HDFS, Hive and more.
Upvotes: 1
Reputation: 11827
Finally figured this out:
export HADOOP_CONF_DIR="/opt/hadoop-2.9.0/etc/hadoop/"
( or whereveer hadoop actually is installed. ) This is documented here:
https://spark.apache.org/docs/latest/configuration.htmlThe "gotcha" turned out to be that HADOOP_CONF_DIR
had to be a fully resolved path, without a ~. For a long time, I had
export HADOOP_CONF_DIR="~/opt/hadoop-2.9.0/etc/hadoop"
and that seems to not work correctly. Changing to an absolute path fixed the problem.
Upvotes: 5
Reputation: 180
Short answer: no. The syntax of "/tmp/sample" points to your local filesystem by default. What the reason for not using rdd.saveAsTextFile("hdfs://localhost:9000/tmp/sample")
?
You could however store the path in a variable and broadcast it to the workers if necessary.
Upvotes: -2