Reputation: 5946
I'm trying to get a files that I have copied to HDFS, however I cannot seem to get clarity on how to actually connect. For example, I placed files in HDFS with the following command:
hdfs dfs -put ~/spark-1.4.0/XXX/YYY input
Which works fine, but now the issue of locating them from PySpark. The documentation for spark point to: https://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
I'm using a version of spark that matches hadoop2.6, but I don't see any conf files in the directory that the above link points to.
Can I access the input files directly - or do I need to do more config with PySpark?
Upvotes: 1
Views: 2714
Reputation: 7442
So Spark doesn't ship with the hadoop-site or yarn-site files since those are specific to your hadoop installation.
You should update the spark-env.sh script to point to the configuration directory the files exist in. If you can't find the hadoop-site.xml file you could try running export and grep for CONF and check for YARN_CONF_DIR
or HADOOP_CONF_DIR
. If you can't find either of those, your hdfs command has presumably found your config, so you could always run strace on it and look for where its loading the configuration file from.
Upvotes: 3