Rkey
Rkey

Reputation: 700

On my Mac, hadoop 3.1.0 finds it native libraries, but spark 2.3.1 does not

Intro

I know that the answer in 99% of cases to this error message:

WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Is simply "It's just a warning, don't worry about it", and then sometimes followed by "Just download the libraries, compile them and point HADOOP_HOME to this folder and add $HADOOP_HOME/bin/native to your LD_LIBRARY_PATH"

That's what I did, but I'm still getting the error, and after two days of googling I'm starting to feel there's something really interesting to learn if I manage to fix this, There is currently a strange behaviour that I do not understand, hopefully we can work through this together.

Ok, so here's what's up:

Hadoop finds the native libraries

Running a hadoop checknative -a gives me this:

dds-MacBook-Pro-2:~ Rkey$ hadoop checknative -a
2018-07-15 16:18:25,956 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
2018-07-15 16:18:25,959 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
2018-07-15 16:18:25,963 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable
Native library checking:
hadoop:  true /usr/local/Cellar/hadoop/3.1.0/lib/native/libhadoop.dylib
zlib:    true /usr/lib/libz.1.dylib
zstd  :  false 
snappy:  true /usr/local/lib/libsnappy.1.dylib
lz4:     true revision:10301
bzip2:   false 
openssl: false build does not support openssl.
ISA-L:   false libhadoop was built without ISA-L support
2018-07-15 16:18:25,986 INFO util.ExitUtil: Exiting with status 1: ExitException

There are some errors here, which might be the cause, but most importantly for now this line is present:

hadoop:  true /usr/local/Cellar/hadoop/3.1.0/lib/native/libhadoop.dylib

When I start my hadoop cluster, this is how it looks:

dds-MacBook-Pro-2:~ Rkey$ hstart
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [dds-MacBook-Pro-2.local]
Starting resourcemanager
Starting nodemanagers

No warnings. I downloaded the hadoop source and built it myself. Before I did that, there where "Cannot find native library"-warnings on starting hadoop.

However, spark does not find the native libraries

This is how it looks when I run pyspark:

dds-MacBook-Pro-2:~ Rkey$ pyspark
Python 3.7.0 (default, Jun 29 2018, 20:13:53) 
[Clang 8.0.0 (clang-800.0.42.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
2018-07-15 16:22:22 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

This is where our old friend makes a re-appearence:

WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

I find this very strange, because I know for a fact it uses the same hadoop that I can start on my own without any warnings. There are no other hadoop installations on my computer.

Clarifications

I downloaded the non-hadoop version of apache-spark from their website called "Pre-build with user-provided Apache Hadoop". This was then put in my Cellar-folder just because I did not want to re-link everything.

As for variables, this is my ~/.profile

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home
export OPENSSL_ROOT_DIR=/usr/local/Cellar/openssl/1.0.2o_2

export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
export HADOOP_HOME=/usr/local/Cellar/hadoop/3.1.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=/usr/local/spark

export PATH=$SPARK_HOME/bin:$PATH

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native

alias hstart="$HADOOP_HOME/sbin/start-dfs.sh;$HADOOP_HOME/sbin/start-yarn.sh"
alias hstop="$HADOOP_HOME/sbin/stop-dfs.sh;$HADOOP_HOME/sbin/stop-yarn.sh"

And here are my additions to spark-env.sh:

export SPARK_DIST_CLASSPATH=$(hadoop classpath)

export LD_LIBRARY_PATH=/usr/local/Cellar/hadoop/3.1.0/lib/native/:$LD_LIBRARY_PATH

This is how the folder /usr/local/Cellar/hadoop/3.1.0/lib/native looks:

Running ls -las inside usr/local/Cellar/hadoop/3.1.0/lib/native

The question

How is it that hadoop can start locally without giving a warning that it's missing its libraries, and goes through the checknatives -a command showing that it finds the native libraries, but when the same hadoop is launched through pyspark I'm suddenly given this warning again?

Update 16/7

I recently made a discovery. The standard version of this classic error message looks like this:

WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

This is actually different, as my error message instead says NativeCodeLoader:60, with 60 instead of 62. This points towards my theory that it is not really the hadoop libraries missing, but some native libraries that hadoop is using that are missing. This is why hadoop can launch without warnings, but pyspark that likely tries to use more native libraries from hadoop, launches with warnings.

This is still just a theory, and until I remove all warnings from the checknative -a call I do not know.

Update 15/7

Currently trying to remove the "WARN bzip2.Bzip2Factory:"-warning from hadoop checknative -a, perhaps this might remove the warning when launching pyspark.

Upvotes: 2

Views: 1662

Answers (1)

Roger Lin
Roger Lin

Reputation: 11

I had the same question as yours. In my case, it is because, from macOS X El Capitan, the SIP mechanism in macOS makes the operating system ignore the LD_LIBRARY_PATH/DYLD_LIBRARY_PATH even though you have already added the Hadoop native library to the value of any of these variables (I get this information from https://help.mulesoft.com/s/article/Variables-LD-LIBRARY-PATH-DYLD-LIBRARY-PATH-are-ignored-on-MAC-OS-if-System-Integrity-Protect-SIP-is-enable).

Actually, the NativeCodeLoader warning from Spark can be ignored. However, if you really want to let the warning go away, you can disable the SIP on macOS X, and then make sure to add $HADOOP_HOME/lib/native to LD_LIBRARY_PATH. Then, the spark can find the Hadoop native library correctly.

Upvotes: 1

Related Questions