PySpark and accessing HDFS

Question

I'm trying to get a files that I have copied to HDFS, however I cannot seem to get clarity on how to actually connect. For example, I placed files in HDFS with the following command:

hdfs dfs -put ~/spark-1.4.0/XXX/YYY input

Which works fine, but now the issue of locating them from PySpark. The documentation for spark point to: https://spark.apache.org/docs/latest/hadoop-third-party-distributions.html

I'm using a version of spark that matches hadoop2.6, but I don't see any conf files in the directory that the above link points to.

Can I access the input files directly - or do I need to do more config with PySpark?

Holden · Accepted Answer

So Spark doesn't ship with the hadoop-site or yarn-site files since those are specific to your hadoop installation.

You should update the spark-env.sh script to point to the configuration directory the files exist in. If you can't find the hadoop-site.xml file you could try running export and grep for CONF and check for YARN_CONF_DIR or HADOOP_CONF_DIR. If you can't find either of those, your hdfs command has presumably found your config, so you could always run strace on it and look for where its loading the configuration file from.

PySpark and accessing HDFS

Answers (1)

Related Questions