Aviral Srivastava
Aviral Srivastava

Reputation: 4582

How to connect to Presto JDBC in PySpark?

I want to connect to Presto server using JDBC in PySpark. I followed a tutorial which is written in Java. I am trying to do the same in my Python3 code but getting an error:

: java.sql.SQLException: No suitable driver

I have tried to execute the following code:

jdbcDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:presto://my_machine_ip:8080/hive/default") \
    .option("user", "airflow") \
    .option("dbtable", "may30_1") \
    .load()

It should be noted that I am using Spark on EMR and so, spark is already provided to me.

The code from the aforementioned tutorial is:

final String JDBC_DRIVER = "com.facebook.presto.jdbc.PrestoDriver";
    final String DB_URL = "jdbc:presto://localhost:9000/catalogName/schemaName";
    //  Database credentials
    final String USER = "username";
    final String PASS = "password";
    Connection conn = null;
    Statement stmt = null;
    try {
      //Register JDBC driver
      Class.forName(JDBC_DRIVER);

Kindly notice the JDBC_DRIVER in the above code, I have not been able to deduce the corresponding assignment in Python3 i.e. in PySpark.

Nor have I added any dependency in any configuration whatsoever.

I expect to successfully connect to my presto. Right now, I am getting the following error, complete stacktrace:

Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/usr/lib/spark/python/pyspark/sql/readwriter.py", line 172, in load
    return self._df(self._jreader.load())
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o90.load.
: java.sql.SQLException: No suitable driver
    at java.sql.DriverManager.getDriver(DriverManager.java:315)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:105)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:104)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:35)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Upvotes: 2

Views: 11590

Answers (1)

Arpit Solanki
Arpit Solanki

Reputation: 9921

You need to perform the following steps for connecting presto to pyspark.

  1. Download Presto JDBC driver jar from official website and put it into spark master node.

  2. Start pyspark shell with following params:

bin/pyspark --driver-class-path com.facebook.presto.jdbc.PrestoDriver --jars /path/to/presto/jdbc/driver/jar/file
  1. Try to connect to presto using spark
jdbcDF = spark.read \
    .format("jdbc") \
    .option("driver", "com.facebook.presto.jdbc.PrestoDriver") \  <-- presto driver class
    .option("url", "jdbc:presto://<machine_ip>:8080/hive/default") \
    .option("user", "airflow") \
    .option("dbtable", "may30_1") \
    .load()

Upvotes: 4

Related Questions