java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2 for Spark 3.0.0


What are possible paths that can make me process data by pyspark 3.0.0 with success from the pure pip installation, well, at least loading data without downgrading the version of Spark?

When I attempted to load datasets of parquet and csv, I would get the exception message as the content below Exception Message displays. The initialization of Spark session is fine, yet when I wanted to load datasets, it just went wrong.

Some Information

Exception Message

Py4JJavaError: An error occurred while calling o94.csv.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/DataSourceV2
    at java.base/java.lang.ClassLoader.defineClass1(Native Method)
    at java.base/java.lang.ClassLoader.defineClass(
    at java.base/
    at java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(
    at java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(
    at java.base/java.lang.ClassLoader.loadClass(
    at java.base/java.lang.ClassLoader.loadClass(
    at java.base/java.lang.Class.forName0(Native Method)
    at java.base/java.lang.Class.forName(
    at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.nextProviderClass(
    at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNextService(
    at java.base/java.util.ServiceLoader$LazyClassPathLookupIterator.hasNext(
    at java.base/java.util.ServiceLoader$2.hasNext(
    at java.base/java.util.ServiceLoader$3.hasNext(
    at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:43)
    at scala.collection.Iterator.foreach(Iterator.scala:941)
    at scala.collection.Iterator.foreach$(Iterator.scala:941)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
    at scala.collection.IterableLike.foreach(IterableLike.scala:74)
    at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
    at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
    at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
    at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
    at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
    at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
    at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:644)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:728)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(
    at java.base/java.lang.reflect.Method.invoke(
    at py4j.reflection.MethodInvoker.invoke(
    at py4j.reflection.ReflectionEngine.invoke(
    at py4j.Gateway.invoke(
    at py4j.commands.AbstractCommand.invokeMethod(
    at py4j.commands.CallCommand.execute(
    at java.base/
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.DataSourceV2
    at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(
    at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(
    at java.base/java.lang.ClassLoader.loadClass(
    ... 45 more

I was using a standalone installation of Spark 3.1.1.

I have tried a lot of things.

I have excluded a lot of jar files.

After a lot of suffering, I decided to delete my Spark installation and install(unpack) a new one.

I don't know why... but it's working.

I had this same problem with spark 3 and finally figured out the cause. I was including a custom jar that relied on the old datasource v2 api.

The solution was to remove the custom jar then spark began working properly.

currently, I have got a way out for manipulating data via Python function APIs for Spark.



# clone a specific branch 
git clone -b branch-3.0 --single-branch
## could try the follwoing command
## git clone --branch v3.0.0

# build a Spark distribution
cd spark
./dev/ --name spark3.0.1 --pip --r --tgz -e -PR -Phive -Phive-thriftserver -Pmesos -Pyarn -Dhadoop.version=3.0.0 -DskipTests -Pkubernetes
## after changing the value of SPARK_HOME in `.bashrc_profile`
source ~/.bashrc_profile

# downlaod needed additional jars into the directory
cd ${SPARK_HOME}/assembly/target/scala-2.12/jars
curl -O
curl -O

# add related configuraionts for Spark
cp ${SPARK_HOME}/conf/spark-defaults.conf.template ${SPARK_HOME}/conf/spark-defaults.conf
## add required or desired parameters into the `spark-defaults.conf`
## as of me, I edited the configuraion file by `vi`

# launch an interactive shell
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.0.1-SNAPSHOT

Using Python version 3.8.5 (default, Jul 24 2020 05:43:01)
SparkSession available as 'spark'.
## after launching, I can read parquet and csv files without the exception

after setting up all the stuff mentioned above, add ${SPARK_HOME}/python to the environment variable PYTHONPATH, then remember to source the related file (I added it into .bashrc_profile).

from pyspark import SparkConf
from pyspark.sql import SparkSession
sc = SparkConf()
threads_max = 512
connection_max = 600
sc.set("spark.driver.memory", "10g")
sc.set('spark.hadoop.fs.s3a.threads.max', threads_max)
sc.set('spark.hadoop.fs.s3a.connection.maximum', connection_max)
sc.set('spark.driver.maxResultSize', 0)
spark = SparkSession.builder.appName("cest-la-vie")\
## after launching, I can read parquet and csv files without the exception


I've also attempted to make PySpark pip installable from the sources' building, but I was stuck on the uploading file size to testpypi. This trying is that I want the pyspark package to be present under the site package directory. The following is my attempting steps:

cd ${SPARK_HOME}/python
# Step 1
python3.8 -m pip install --user --upgrade setuptools wheel
# Step 2
python3.8 sdist bdist_wheel ## /opt/spark/python
# Step 3
python3.8 -m pip install --user --upgrade twine
# Step 4
python3.8 -m twine upload --repository testpypi dist/*
## have registered an account for testpypi and got a token
Uploading pyspark-3.0.1.dev0-py2.py3-none-any.whl

## stuck here
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 345M/345M [00:49<00:00, 7.33MB/s]
Received "503: first byte timeout" Package upload appears to have failed.  Retry 1 of 5

Upvotes: 0

