WestCoastProjects
WestCoastProjects

Reputation: 63062

Unable to initialize main class org.apache.spark.deploy.SparkSubmit when trying to run pyspark

I have a conda installation of python 3.7

$python3 --version
Python 3.7.6

pyspark was installed via pip3 install (conda does not have a native package for it).

$conda list | grep pyspark
pyspark                   2.4.5                    pypi_0    pypi

Here is what pip3 tells me:

$pip3 install pyspark
Requirement already satisfied: pyspark in ./miniconda3/lib/python3.7/site-packages (2.4.5)
Requirement already satisfied: py4j==0.10.7 in ./miniconda3/lib/python3.7/site-packages (from pyspark) (0.10.7)

jdk 11 is installed:

    $java -version
    openjdk version "11.0.2" 2019-01-15
    OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
    OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)

When attempting to import pyspark things are not going so well. Here is a mini test program:

from pyspark.sql import SparkSession
import os, sys
def setupSpark():
    os.environ["PYSPARK_SUBMIT_ARGS"] = "pyspark-shell"
    spark = SparkSession.builder.appName("myapp").master("local").getOrCreate()
    return spark

sp = setupSpark()
df = sp.createDataFrame({'a':[1,2,3],'b':[4,5,6]})
df.show()

That results in :

Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmit Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter

Here is full details:

$python3 sparktest.py 
Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmit
Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter
Traceback (most recent call last):
  File "sparktest.py", line 9, in <module>
    sp = setupSpark()
  File "sparktest.py", line 6, in setupSpark
    spark = SparkSession.builder.appName("myapp").master("local").getOrCreate()
  File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 367, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 133, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
    return _launch_gateway(conf)
  File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number

Any pointers or info on working environment in conda would be appreciated.

Update It may be the case that pyspark were available from conda-forge only: I only started to use that for conda install recently. But it does not change the result:

conda install -c conda-forge conda-forge::pyspark

Collecting package metadata (current_repodata.json): done
Solving environment: done


# All requested packages already installed.

Re-running the code above still gives us:

Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmit
Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter

Upvotes: 4

Views: 4847

Answers (2)

Thang Pham
Thang Pham

Reputation: 1026

The following steps are for running your mini test program in Conda environment:

Step 1: Create and activate a new Conda environment

conda create -n test python=3.7 -y
conda activate test

Step 2: Install the latest pyspark and pandas

pip install -U pyspark pandas   # Note: I also tested pyspark version 2.4.7

Step 3: Run the mini test. (I have updated some changes to create DataFrame from DataFrame instead of dict)

from pyspark.sql import SparkSession
import os, sys
import pandas as pd

def setupSpark():
    os.environ["PYSPARK_SUBMIT_ARGS"] = "pyspark-shell"
    spark = SparkSession.builder.appName("myapp").master("local").getOrCreate()
    return spark

sp = setupSpark()
df = sp.createDataFrame(pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}))
df.show()

Step 4: Enjoy the output

+---+---+
|  a|  b|
+---+---+
|  1|  4|
|  2|  5|
|  3|  6|
+---+---+

Java version that I used to install pyspark

$ java -version
java version "15.0.2" 2021-01-19
Java(TM) SE Runtime Environment (build 15.0.2+7-27)
Java HotSpot(TM) 64-Bit Server VM (build 15.0.2+7-27, mixed mode, sharing)

Upvotes: 1

WestCoastProjects
WestCoastProjects

Reputation: 63062

The following is not really an answer but instead a workaround.A real answer is still appreciated!

I was unable to get pyspark to run at all within a conda environment. Instead I backtracked to using brew installed python and spark/pyspark in 3.9. Here are the commands I used.

brew install python3
git -C "/usr/local/Homebrew/Library/Taps/homebrew/homebrew-cask" fetch --unshallow
brew install apache-spark
brew link apache-spark
brew link --overwrite apache-spark
brew install scala

Upvotes: 0

Related Questions