Reputation: 1316
I'm trying to run a simple Graphframes example. I have both Python 3.6.8 and Python 2.7.15, as well as Apache Maven 3.6.0, Java 1.8.0, Apache Spark 2.4.4 and Scala code runner version 2.11.12.
I got this error:
An error occurred while calling o58.loadClass.
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI
I tried to put this solution in motion, but I became stuck on step 2.
I ran pyspark --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
and got the following output:
Python 2.7.15+ (default, Jul 9 2019, 16:51:35)
[GCC 7.4.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /home/jessica/.ivy2/cache
The jars for the packages stored in: /home/jessica/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-1be543dc-eac1-4324-bef5-4bab70bd9c95;1.0
confs: [default]
downloading file:/home/jessica/.m2/repository/graphframes/graphframes/0.7.0-spark2.4-s_2.11/graphframes-0.7.0-spark2.4-s_2.11.jar ..
[SUCCESSFUL ] graphframes#graphframes;0.7.0-spark2.4-s_2.11!graphframes.jar (18ms)
downloading file:/home/jessica/.m2/repository/org/slf4j/slf4j-api/1.7.16/slf4j-api-1.7.16.jar ...
[SUCCESSFUL ] org.slf4j#slf4j-api;1.7.16!slf4j-api.jar (13ms)
:: resolution report :: resolve 786773ms :: artifacts dl 67ms
:: modules in use:
graphframes#graphframes;0.7.0-spark2.4-s_2.11 from local-m2-cache in [default]
org.slf4j#slf4j-api;1.7.16 from spark-list in [default]
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
| default | 2 | 1 | 1 | 0 || 2 | 2 |
:: problems summary ::
Server access error at url ( Connection timed out (Connection timed out))
Server access error at url ( Connection timed out (Connection timed out))
Server access error at url ( Connection timed out (Connection timed out))
Server access error at url ( Connection timed out (Connection timed out))
Server access error at url ( Connection timed out (Connection timed out))
Server access error at url ( Connection timed out (Connection timed out))
unknown resolver sbt-chain
unknown resolver null
:: retrieving :: org.apache.spark#spark-submit-parent-1a173e58-c356-43d7-9112-b06817ef3674
confs: [default]
2 artifacts copied, 0 already retrieved (411kB/27ms)
he19/10/25 10:39:01 WARN Utils: Your hostname, jessica-VirtualBox resolves to a loopback address:; using instead (on interface enp0s3)
19/10/25 10:39:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
lp19/10/25 10:39:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Exception in thread "main" java.nio.file.NoSuchFileException: /tmp/tmp6pP3C_/
at sun.nio.fs.UnixException.translateToIOException(
at sun.nio.fs.UnixException.rethrowAsIOException(
at sun.nio.fs.UnixException.rethrowAsIOException(
at sun.nio.fs.UnixFileSystemProvider.newByteChannel(
at java.nio.file.Files.newByteChannel(
at java.nio.file.Files.createFile(
at java.nio.file.TempFileHelper.create(
at java.nio.file.TempFileHelper.createTempFile(
at java.nio.file.Files.createTempFile(
at org.apache.spark.api.python.PythonGatewayServer$.main(PythonGatewayServer.scala:70)
at org.apache.spark.api.python.PythonGatewayServer.main(PythonGatewayServer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Needless to say this is not the expected output, all of the links that timed out lead to 404s. My PC is behind a proxy, but the proxy settings are configured in the Maven settings files and I know that they work correctly.
Are there other proxy settings to change? Is there maybe another way to install these dependencies?
I changed my /usr/share/jupyter/kernels/python3/kernel.json
file to:
"argv": [
"env": {
"PYSPARK_SUBMIT_ARGS": "--packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 --master local[10] pyspark-shell"
"display_name": "Python 3",
"language": "python"
then tried to run my Python script in a Jupyter Notebook. This did not work. In fact now it causes this error as soon as I run my Python script (after it has imported the required imports, it crashes)
I tweaked my Firefox and downloaded the files myself.
-rw-rw-r-- 1 jessica jessica 381110 Oct 22 12:17 graphframes-0.7.0-spark2.4-s_2.11.jar
-rw-rw-r-- 1 jessica jessica 2541 Oct 22 12:14 graphframes-0.7.0-spark2.4-s_2.11.pom
I then ran mvn install:install-file -Dfile=graphframes-0.7.0-spark2.4-s_2.11.jar -DpomFile=graphframes-0.7.0-spark2.4-s_2.11.pom
, and though that procedure was successful, I still can't run my script (still for the same reason). However, there is now a graphframes
folder in my maven repository that contains all the required files.
I have uninstalled and reinstalled Jupyter, notebook, graphframes, toree, iPython and have added Anaconda - all for both Python 2.7 and Python 3. I could not install the Apache Toree kernel (v0.3.0) for Python/Pyspark (I do have SQL and Scala, apparently the Python/Pyspark kernel it is no longer supported - solutions for this are also welcomed).
My SPARK_HOME=~/spark/spark-2.2.0-bin-hadoop2.7
variable has also been set, as well as PYSPARK_DRIVER_PYTHON="jupyter"
Upvotes: 3
Views: 2684
Reputation: 2726
Furthering jane-wayne's answer.
from pyspark.sql import SparkSession
spark = (
.config("spark.jars.packages", "graphframes:graphframes:0.8.1-spark2.4-s_2.11")
.config("spark.jars.repositories", "")
Upvotes: 2
Reputation: 8855
I am seeing the same thing you are seeing. The problem is that the repository bintray
that hosted the artifacts shut down on May 2021. When you specify the Maven coordinates for graphframes, you should also provide the current repository that hosts this artifact.
Executing PySpark as follow worked for me.
pyspark \
--packages graphframes:graphframes:0.8.1-spark2.4-s_2.11 \
Upvotes: 1
Reputation: 1316
I solved the issue, using advice from this site.
Long story short, put the jars straight in $SPARK_HOME/jars
Upvotes: 1