Pyspark read jdbc giving errors . How to fix?

I am connecting to RDS MySQL using JDBC in pyspark . I have tried almost everything that I found on Stackoverflow for debugging but still, i am unable to make it work .

spark = SparkSession.builder.config("spark.jars", mysql_jar) \
            .master("local[*]").appName("PySpark_MySQL_test").getOrCreate()
df= spark.read.format("jdbc").option("url", "jdbc:mysql://hostname.amazonaws.com:1150/dbname?user=user_name&password=password") \
            .option("driver", "com.mysql.cj.jdbc.Driver").option("dbtable", "table_name").load()

I have tried using the same connection details in pymysql library of python it connects and brings back the result.
But here I getting the below error and am unable to solve it.

raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure

The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
    at com.mysql.cj.jdbc.exceptions.SQLError.createCommunicationsException(SQLError.java:174)
    at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:64)
    at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:827)
    at com.mysql.cj.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:447)
    at com.mysql.cj.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:237)
    at com.mysql.cj.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:199)
    at org.apache.spark.sql.execution.datasources.jdbc.connection.BasicConnectionProvider.getConnection(BasicConnectionProvider.scala:49)
    at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.create(ConnectionProvider.scala:68)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:62)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:226)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:225)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.mysql.cj.exceptions.CJCommunicationsException: Communications link failure

Upvotes: 0

Answers (2)

Lijo Abraham

Reputation: 881

For anyone coming here for an answer using Docker give the below solution a try. use the below configuration

source_df = spark.read.format('jdbc').options(
        url='jdbc:mysql://host.docker.internal:3306/superset?useSSL=false&allowPublicKeyRetrieval=true',
        driver='com.mysql.cj.jdbc.Driver',
        dbtable='table',
        user='root',
        password='root').load()

I have tried the host with localhost, 127.0.0.1, and even the IPAddress from docker inspect but didn't work then changed it to host.docker.internal and it worked.

Upvotes: 0

Durant

Reputation: 11

I have experienced the same issues.Now it is worked.The core reason is spark use master node to connect mysql and use work nodes to execute task.So you can connect mysql while raise communication error.Based on this theory,you can open the security rules on mysql to let all spark node can connect to mysql

Upvotes: 1

Pyspark read jdbc giving errors . How to fix?

Answers (2)

Related Questions