Reputation: 2995
I want to query a PostgreSQL with pyspark within a jupyter notebook. I have browsed a lot of questions on StackOverflow but none of them worked for me, mainly because the answers seemed outdated. Here's my minimal code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
Running this from a notebook would raise the following error:
Py4JJavaError: An error occurred while calling o69.jdbc.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at scala.Option.getOrElse(Option.scala:121)...
The principal tips I have found were summed up in the link below but unfortunately I can't get them to work in my notebook:
Pyspark connection to Postgres database in ipython notebook
Note: I am using Spark 2.3.1 and Python 3.6.3 and I am able to connect to the database from the pyspark shell if I specify the jar location.
pyspark --driver-class-path /home/.../postgresql.jar --jars /home/.../jars/postgresql.jar
Thanks to anyone who can help me on this one.
EDIT
The answers from How to load jar dependenices in IPython Notebook are already listed in the link I shared myself, and do not work for me. I already tried to configure the environment variable from the notebook:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql.jar --jars /path/to/postgresql.jar'
There's nothing wrong with the file path or the file itself since it works fine when I specify it and run the pyspark-shell.
Upvotes: 10
Views: 10241
Reputation: 2995
Using the config
method worked for me:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
Upvotes: 14