Liangliang Qiu
Liangliang Qiu

Reputation: 51

sedona error : java.lang.NoClassDefFoundError: org/opengis/referencing/FactoryException

/usr/share/spark-3.0/bin/pyspark --queue=szsc
--master=yarn
--packages org.apache.sedona:sedona-core-3.0_2.12:1.0.0-incubating,org.apache.sedona:sedona-sql-3.0_2.12:1.0.0-incubating,org.apache.sedona:sedona-viz-3.0_2.12:1.0.0-incubating,org.apache.sedona:sedona-python-adapter-3.0_2.12:1.0.0-incubating
--driver-memory 4g
--num-executors 100
--executor-memory 8g
--conf spark.driver.memoryOverhead=5G
--conf spark.executor.memoryOverhead=5G

spark-sql:

sql5="""
        select 
            'aoi' as type,
            b.shipment_id,
            b.order_type,
            b.sub_order_type,
            b.buyer_geo_lat,
            b.buyer_geo_lng,
            a.aoi_id as region_id,
            100 as region_level 
        from tmp_aoi_polygon_tab a, tmp_buyer_pin_tab b
        where ST_Contains(a.aoi_polygon, b.point)
"""

df5=spark.sql(sql5) df5.count()

error log:

21/05/25 23:31:20 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/share/spark-3.0/python/pyspark/sql/dataframe.py", line 585, in count
    return int(self._jdf.count())
  File "/usr/share/spark-3.0/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/usr/share/spark-3.0/python/pyspark/sql/utils.py", line 128, in deco
    return f(*a, **kw)
  File "/usr/share/spark-3.0/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o92.count.
: java.lang.NoClassDefFoun`enter code here`dError: org/opengis/referencing/FactoryException
        at org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryExec.toSpatialRdd(TraitJoinQueryExec.scala:169)
        at org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryExec.toSpatialRdd$(TraitJoinQueryExec.scala:166)
        at org.apache.spark.sql.sedona_sql.strategy.join.RangeJoinExec.toSpatialRdd(RangeJoinExec.scala:37)
        at org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryExec.toSpatialRddPair(TraitJoinQueryExec.scala:164)
        at org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryExec.toSpatialRddPair$(TraitJoinQueryExec.scala:160)
        at org.apache.spark.sql.sedona_sql.strategy.join.RangeJoinExec.toSpatialRddPair(RangeJoinExec.scala:37)
        at org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryExec.doExecute(TraitJoinQueryExec.scala:65)
        at org.apache.spark.sql.sedona_sql.strategy.join.TraitJoinQueryExec.doExecute$(TraitJoinQueryExec.scala:56)
        at org.apache.spark.sql.sedona_sql.strategy.join.RangeJoinExec.doExecute(RangeJoinExec.scala:37)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
        at org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:525)
        at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:453)
        at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:452)
        at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:496)
        at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:47)
        at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
        at org.apache.spark.sql.execution.columnar.CachedRDDBuilder.buildBuffers(InMemoryRelation.scala:89)
        at org.apache.spark.sql.execution.columnar.CachedRDDBuilder.cachedColumnBuffers(InMemoryRelation.scala:65)
        at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.filteredCachedBatches(InMemoryTableScanExec.scala:310)
        at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.inputRDD$lzycompute(InMemoryTableScanExec.scala:135)
        at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.inputRDD(InMemoryTableScanExec.scala:124)
        at org.apache.spark.sql.execution.columnar.InMemoryTableScanExec.doExecute(InMemoryTableScanExec.scala:341)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
        at org.apache.spark.sql.execution.InputAdapter.inputRDD(WholeStageCodegenExec.scala:525)
        at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs(WholeStageCodegenExec.scala:453)
        at org.apache.spark.sql.execution.InputRDDCodegen.inputRDDs$(WholeStageCodegenExec.scala:452)
        at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:496)
        at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:162)
        at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
        at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:106)
        at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD(ShuffleExchangeExec.scala:106)
        at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture$lzycompute(ShuffleExchangeExec.scala:110)
        at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture(ShuffleExchangeExec.scala:109)
        at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.$anonfun$doMaterialize$1(QueryStageExec.scala:160)
        at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
        at org.apache.spark.sql.execution.adaptive.ShuffleQueryStageExec.doMaterialize(QueryStageExec.scala:160)
        at org.apache.spark.sql.execution.adaptive.QueryStageExec.$anonfun$materialize$1(QueryStageExec.scala:79)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
        at org.apache.spark.sql.execution.adaptive.QueryStageExec.materialize(QueryStageExec.scala:79)
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$4(AdaptiveSparkPlanExec.scala:175)
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$4$adapted(AdaptiveSparkPlanExec.scala:173)
        at scala.collection.immutable.List.foreach(List.scala:392)
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$getFinalPhysicalPlan$1(AdaptiveSparkPlanExec.scala:173)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.getFinalPhysicalPlan(AdaptiveSparkPlanExec.scala:159)
        at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.executeCollect(AdaptiveSparkPlanExec.scala:255)
        at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:2981)
        at org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:2980)
        at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
        at org.apache.spark.sql.Dataset.count(Dataset.scala:2980)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: org.opengis.referencing.FactoryException
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        ... 87 more

Upvotes: 5

Views: 2550

Answers (3)

Ankur Pandey
Ankur Pandey

Reputation: 51

I also faced a similar issue while migrating from org.datasyslab.GeoSpark to sedona. Here is the stack trace:

Caused by: java.lang.ClassNotFoundException: org.opengis.referencing.FactoryException
    at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:594)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)

Spark version: 3.2.0, Scala version: 2.12.10, Java version: 8

The java.lang.ClassNotFoundException: org.opengis.referencing.FactoryException error indicates that the required GeoTools library is missing from your classpath. To fix this, you need to add the GeoTools dependency to your pom.xml.

Adding the following dependency has fixed this issue:

<dependency>
    <groupId>org.geotools</groupId>
    <artifactId>gt-referencing</artifactId>
    <version>24.1</version>
</dependency>

Here is this list of all the dependencies that I used to work with apache sedona in my current project.

<dependency>
    <groupId>org.locationtech.jts</groupId>
    <artifactId>jts-core</artifactId>
    <version>1.18.1</version>
</dependency>
<dependency>
    <groupId>org.geotools</groupId>
    <artifactId>gt-referencing</artifactId>
    <version>24.1</version>
</dependency>
<dependency>
    <groupId>org.geotools</groupId>
    <artifactId>gt-main</artifactId>
    <version>24.1</version>
</dependency>
<dependency>
    <groupId>org.wololo</groupId>
    <artifactId>jts2geojson</artifactId>
    <version>0.14.3</version>
</dependency>
<dependency>
    <groupId>org.apache.sedona</groupId>
    <artifactId>sedona-core-3.0_2.12</artifactId>
    <version>1.1.1-incubating</version>
</dependency>
<dependency>
    <groupId>org.apache.sedona</groupId>
    <artifactId>sedona-sql-3.0_2.12</artifactId>
    <version>1.1.1-incubating</version>
</dependency>
<dependency>
    <groupId>org.apache.sedona</groupId>
    <artifactId>sedona-viz-3.0_2.12</artifactId>
    <version>1.1.1-incubating</version>
</dependency>

Upvotes: 0

jalberdi
jalberdi

Reputation: 69

For the Python solution, I use pyspark, within a virtual env. I added missing jars, into the virtual env directory of Spark $DIR_VIRTUAL_ENV/lib/python3.8/site-packages/pyspark/jars, as follows:

wget https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.1.0-25.2/geotools-wrapper-1.1.0-25.2.jar
wget https://repo1.maven.org/maven2/org/apache/sedona/sedona-python-adapter-3.0_2.12/1.2.0-incubating/sedona-python-adapter-3.0_2.12-1.2.0-incubating.jar
wget https://repo1.maven.org/maven2/org/apache/sedona/sedona-viz-3.0_2.12/1.2.0-incubating/sedona-viz-3.0_2.12-1.2.0-incubating.jar

Instead, you can download them manually, and locate in the aforementioned directory.

Afterwards, exit and start over the pyspark shell, no need to import anything else explicitly.

Partially based on https://sedona.apache.org/setup/databricks/.


Actual Python environment:

anytree==2.8.0
apache-sedona==1.2.0
astroid==1.3.2
attrs==21.4.0
certifi==2021.10.8
click==8.1.2
click-plugins==1.1.1
cligj==0.7.2
cycler==0.11.0
Fiona==1.8.21
fonttools==4.32.0
geopandas==0.10.2
importlib-metadata==4.11.3
joblib==1.1.0
jts==0.0.3
kiwisolver==1.4.2
logilab-common==1.9.2
mapclassify==2.4.3
matplotlib==3.5.1
munch==2.5.0
mypy-extensions==0.4.3
networkx==2.8
numpy==1.22.3
packaging==21.3
pandas==1.4.2
Pillow==9.1.0
py2puml==0.5.4
py4j==0.10.9.3
pyarrow==7.0.0
pydoop==2.0.0
pylint==1.4.0
pypandoc==1.7.4
pyparsing==3.0.8
pyproj==3.3.0
pyspark==3.2.1
python-dateutil==2.8.2
pytz==2022.1
scikit-learn==1.0.2
scipy==1.8.0
Shapely==1.8.1.post1
six==1.16.0
threadpoolctl==3.1.0
typing-extensions==4.1.1
venv-pack==0.2.0
xlrd==2.0.1
zipp==3.7.0

Disclaimer: I don't have enough reputation to comment in answers.

Upvotes: 1

Sebastian Hanania
Sebastian Hanania

Reputation: 79

The same thing happened to me about 2 days ago and I finally found the solution, try to use and import the library: For Scala:

"org.datasyslab" % "geotools-wrapper" % "geotools-24.1"
"org.locationtech.jts" % "jts-core" % "1.17.0"

import org.datasyslab

And for pyspark you need to import datasyslab geotools (ST sql functions) and jts.

This happens because sedona no longer incorporates the dependencies for its sql functions, I hope it helps you.

Upvotes: 3

Related Questions