Do I need to install Hadoop in order to use all aspects of Pyspark?

Question

I've installed pyspark, but have not installed any hadoop or spark version seperatly.

Apparently under Windows pyspark needs access to the winutils.exe for Hadoop for some things (e.g. writing files to disk). When pyspark wants to access the winutilis.exe it looks for it in the bin directory of the folder specified by the HADOOP_HOME environment variable (user variable). Therefore I copied the winutils.exe into the bin directory of pyspark (.\site-packages\pyspark\bin) and specified HADOOP_HOME as .\site-packages\pyspark\. This solved the problem of getting the error message: Failed to locate the winutils binary in the hadoop binary path.

However, when I start a Spark session using pyspark I still get the following warning:

WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Installing Hadoop and then specifying its installation directory for HADDOP_HOME did prevent the warning. Has a specific hadoop version to be installed to make pyspark work without restrictions?

Do I need to install Hadoop in order to use all aspects of Pyspark?

Answers (1)

Related Questions