Reputation: 63062
I have a conda
installation of python 3.7
$python3 --version
Python 3.7.6
pyspark
was installed via pip3 install
(conda
does not have a native package for it).
$conda list | grep pyspark
pyspark 2.4.5 pypi_0 pypi
Here is what pip3
tells me:
$pip3 install pyspark
Requirement already satisfied: pyspark in ./miniconda3/lib/python3.7/site-packages (2.4.5)
Requirement already satisfied: py4j==0.10.7 in ./miniconda3/lib/python3.7/site-packages (from pyspark) (0.10.7)
jdk 11
is installed:
$java -version
openjdk version "11.0.2" 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)
When attempting to import pyspark
things are not going so well. Here is a mini test program:
from pyspark.sql import SparkSession
import os, sys
def setupSpark():
os.environ["PYSPARK_SUBMIT_ARGS"] = "pyspark-shell"
spark = SparkSession.builder.appName("myapp").master("local").getOrCreate()
return spark
sp = setupSpark()
df = sp.createDataFrame({'a':[1,2,3],'b':[4,5,6]})
df.show()
That results in :
Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmit Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter
Here is full details:
$python3 sparktest.py
Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmit
Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter
Traceback (most recent call last):
File "sparktest.py", line 9, in <module>
sp = setupSpark()
File "sparktest.py", line 6, in setupSpark
spark = SparkSession.builder.appName("myapp").master("local").getOrCreate()
File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 367, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 133, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/context.py", line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 46, in launch_gateway
return _launch_gateway(conf)
File "/Users/steve/miniconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 108, in _launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
Any pointers or info on working environment in conda would be appreciated.
Update It may be the case that pyspark
were available from conda-forge only: I only started to use that for conda install
recently. But it does not change the result:
conda install -c conda-forge conda-forge::pyspark
Collecting package metadata (current_repodata.json): done
Solving environment: done
# All requested packages already installed.
Re-running the code above still gives us:
Error: Unable to initialize main class org.apache.spark.deploy.SparkSubmit
Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/spi/Filter
Upvotes: 4
Views: 4847
Reputation: 1026
The following steps are for running your mini test program in Conda environment:
Step 1: Create and activate a new Conda environment
conda create -n test python=3.7 -y
conda activate test
Step 2: Install the latest pyspark
and pandas
pip install -U pyspark pandas # Note: I also tested pyspark version 2.4.7
Step 3: Run the mini test. (I have updated some changes to create DataFrame from DataFrame
instead of dict
)
from pyspark.sql import SparkSession
import os, sys
import pandas as pd
def setupSpark():
os.environ["PYSPARK_SUBMIT_ARGS"] = "pyspark-shell"
spark = SparkSession.builder.appName("myapp").master("local").getOrCreate()
return spark
sp = setupSpark()
df = sp.createDataFrame(pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}))
df.show()
Step 4: Enjoy the output
+---+---+
| a| b|
+---+---+
| 1| 4|
| 2| 5|
| 3| 6|
+---+---+
Java version that I used to install pyspark
$ java -version
java version "15.0.2" 2021-01-19
Java(TM) SE Runtime Environment (build 15.0.2+7-27)
Java HotSpot(TM) 64-Bit Server VM (build 15.0.2+7-27, mixed mode, sharing)
Upvotes: 1
Reputation: 63062
The following is not really an answer but instead a workaround.A real answer is still appreciated!
I was unable to get pyspark
to run at all within a conda
environment. Instead I backtracked to using brew
installed python and spark/pyspark in 3.9. Here are the commands I used.
brew install python3
git -C "/usr/local/Homebrew/Library/Taps/homebrew/homebrew-cask" fetch --unshallow
brew install apache-spark
brew link apache-spark
brew link --overwrite apache-spark
brew install scala
Upvotes: 0