Reputation: 1571
I've installed pyspark, but have not installed any hadoop or spark version seperatly.
Apparently under Windows pyspark needs access to the winutils.exe for Hadoop for some things (e.g. writing files to disk). When pyspark wants to access the winutilis.exe it looks for it in the bin directory of the folder specified by the HADOOP_HOME environment variable (user variable). Therefore I copied the winutils.exe into the bin directory of pyspark (.\site-packages\pyspark\bin
) and specified HADOOP_HOME as .\site-packages\pyspark\
. This solved the problem of getting the error message: Failed to locate the winutils binary in the hadoop binary path
.
However, when I start a Spark session using pyspark I still get the following warning:
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Installing Hadoop and then specifying its installation directory for HADDOP_HOME did prevent the warning. Has a specific hadoop version to be installed to make pyspark work without restrictions?
Upvotes: 2
Views: 2975
Reputation: 1253
Hadoop installation is not mandatory.
Spark is distributed computing engine only.
Spark offers only computation & it doesn't have any storage. But Spark is integrated with huge variety of storage systems like HDFS, Cassandra, HBase, Mongo DB, Local file system etc....
Spark is designed to run on top of variety of resource management platforms like Spark, Mesos, YARN, Local, Kubernetes etc....
PySpark is Python API on top of Spark to develop Spark applications in Python. So Hadoop installation is not mandatory.
Note: Hadoop Installation is only required either to run Pyspark application on top of YARN or to access input/output of Pyspark application from/to HDFS/Hive/HBase or Both.
About the warning you posted is normal one. So ignore it.
Upvotes: 4