Reputation: 1493

java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2 for Spark 3.0.0

Brief

What are possible paths that can make me process data by pyspark 3.0.0 with success from the pure pip installation, well, at least loading data without downgrading the version of Spark?

When I attempted to load datasets of parquet and csv, I would get the exception message as the content below Exception Message displays. The initialization of Spark session is fine, yet when I wanted to load datasets, it just went wrong.

Some Information

Java: openjdk 11
Python: 3.8.5
Mode: local mode
Operating System: Ubuntu 16.04.6 LTS
Notes:
1. I executed python3.8 -m pip install pyspark to install Spark.
2. When I looked up the jar of spark-sql_2.12-3.0.0.jar (which is under the Python site-package path, i.e., ~/.local/lib/python3.8/site-packages/pyspark/jars in my case), there is no v2 under spark.sql.sources, the most similar one I found is an interface called DatSourceRegister under the same package.
3. The most similar question I found on Stackoverflow is PySpark structured Streaming + Kafka Error (Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.StreamWriteSupport ) where downgrading the Spark version is recommended throughout the information on that page.

Exception Message

Py4JJavaError: An error occurred while calling o94.csv.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/DataSourceV2
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1016)
    at java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
    at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:800)
    at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:698)
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:621)
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:579)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:575)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
    at java.base/java.lang.Class.forName0(Native Method)
    at java.base/java.lang.Class.forName(Class.java:398)
    at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.nextProviderClass(ServiceLoader.java:1209)
    at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(ServiceLoader.java:1220)
    at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(ServiceLoader.java:1264)
    at java.base/java.util.ServiceLoader$2.hasNext(ServiceLoader.java:1299)
    at java.base/java.util.ServiceLoader$3.hasNext(ServiceLoader.java:1384)
    at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:43)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
    at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
    at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
    at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
    at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
    at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:644)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:728)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
    ... 45 more

Upvotes: 3

Answers (3)

Pastor

Reputation: 318

I was using a standalone installation of Spark 3.1.1.

I have tried a lot of things.

I have excluded a lot of jar files.

After a lot of suffering, I decided to delete my Spark installation and install(unpack) a new one.

I don't know why... but it's working.

Upvotes: 0

bricard

Reputation: 26

I had this same problem with spark 3 and finally figured out the cause. I was including a custom jar that relied on the old datasource v2 api.

The solution was to remove the custom jar then spark began working properly.

Upvotes: 1

Scott Hsieh

Reputation: 1493

currently, I have got a way out for manipulating data via Python function APIs for Spark.

workaround

# clone a specific branch 
git clone -b branch-3.0 --single-branch https://github.com/apache/spark.git
## could try the follwoing command
## git clone --branch v3.0.0 https://github.com/apache/spark.git

# build a Spark distribution
cd spark
./dev/make-distribution.sh --name spark3.0.1 --pip --r --tgz -e -PR -Phive -Phive-thriftserver -Pmesos -Pyarn -Dhadoop.version=3.0.0 -DskipTests -Pkubernetes
## after changing the value of SPARK_HOME in `.bashrc_profile`
source ~/.bashrc_profile

# downlaod needed additional jars into the directory
cd ${SPARK_HOME}/assembly/target/scala-2.12/jars
curl -O https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.0.0/hadoop-aws-3.0.0.jar
curl -O https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.828/aws-java-sdk-bundle-1.11.828.jar
cd ${SPARK_HOME}

# add related configuraionts for Spark
cp ${SPARK_HOME}/conf/spark-defaults.conf.template ${SPARK_HOME}/conf/spark-defaults.conf
## add required or desired parameters into the `spark-defaults.conf`
## as of me, I edited the configuraion file by `vi`

# launch an interactive shell
pyspark
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.1-SNAPSHOT
      /_/

Using Python version 3.8.5 (default, Jul 24 2020 05:43:01)
SparkSession available as 'spark'.
## after launching, I can read parquet and csv files without the exception

2
after setting up all the stuff mentioned above, add ${SPARK_HOME}/python to the environment variable PYTHONPATH, then remember to source the related file (I added it into .bashrc_profile).

from pyspark import SparkConf
from pyspark.sql import SparkSession
sc = SparkConf()
threads_max = 512
connection_max = 600
sc.set("spark.driver.memory", "10g")
sc.set('spark.hadoop.fs.s3a.threads.max', threads_max)
sc.set('spark.hadoop.fs.s3a.connection.maximum', connection_max)
sc.set('spark.hadoop.fs.s3a.aws.credentials.provider',
           'com.amazonaws.auth.EnvironmentVariableCredentialsProvider')
sc.set('spark.driver.maxResultSize', 0)
spark = SparkSession.builder.appName("cest-la-vie")\
    .master("local[*]").config(conf=sc).getOrCreate()
## after launching, I can read parquet and csv files without the exception

notes

I've also attempted to make PySpark pip installable from the sources' building, but I was stuck on the uploading file size to testpypi. This trying is that I want the pyspark package to be present under the site package directory. The following is my attempting steps:

cd ${SPARK_HOME}/python
# Step 1
python3.8 -m pip install --user --upgrade setuptools wheel
# Step 2
python3.8 setup.py sdist bdist_wheel ## /opt/spark/python
# Step 3
python3.8 -m pip install --user --upgrade twine
# Step 4
python3.8 -m twine upload --repository testpypi dist/*
## have registered an account for testpypi and got a token
Uploading pyspark-3.0.1.dev0-py2.py3-none-any.whl

## stuck here
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 345M/345M [00:49<00:00, 7.33MB/s]
Received "503: first byte timeout" Package upload appears to have failed.  Retry 1 of 5