Mpizos Dimitris
Mpizos Dimitris

Reputation: 5001

Using pyspark to connect to PostgreSQL

I am trying to connect to a database with pyspark and I am using the following code:

sqlctx = SQLContext(sc)
df = sqlctx.load(
    url = "jdbc:postgresql://[hostname]/[database]",
    dbtable = "(SELECT * FROM talent LIMIT 1000) as blah",
    password = "MichaelJordan",
    user =  "ScottyPippen",
    source = "jdbc",
    driver = "org.postgresql.Driver"
)

and I am getting the following error:

enter image description here

Any idea why is this happening?

Edit: I am trying to run the code locally in my computer.

Upvotes: 36

Views: 123999

Answers (12)

Rafael
Rafael

Reputation: 761

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download/

Then replace the database configuration values by yours.

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
    .getOrCreate()

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/databasename") \
    .option("dbtable", "tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.printSchema()

More info: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Upvotes: 43

Darshan Raju
Darshan Raju

Reputation: 106

Download postgres JDBC driver from https://jdbc.postgresql.org/download.html and use the script below. Changes to make:

  1. Edit PATH_TO_JAR_FILE
  2. Save your DB credentials in an environment file and load them
  3. Query the DB using query option and limit using fetch size
import os
from pyspark.sql import SparkSession

PATH_TO_JAR_FILE = "/home/user/Downloads/postgresql-42.3.3.jar"
spark = SparkSession \
    .builder \
    .appName("Example") \
    .config("spark.jars", PATH_TO_JAR_FILE) \
    .getOrCreate()

DB_HOST = os.environ.get("PG_HOST")
DB_PORT = os.environ.get("PG_PORT")
DB_NAME = os.environ.get("PG_DB_CLEAN")
DB_PASSWORD = os.environ.get("PG_PASSWORD")
DB_USER = os.environ.get("PG_USERNAME")

df = spark.read \
    .format("jdbc") \
    .option("url", f"jdbc:postgresql://{DB_HOST}:{DB_PORT}/{DB_NAME}") \
    .option("user", DB_USER) \
    .option("password", DB_PASSWORD) \
    .option("driver", "org.postgresql.Driver") \
    .option("query","select * from your_table") \
    .option('fetchsize',"1000") \
    .load()

df.printSchema()

Upvotes: 0

Finch Xu
Finch Xu

Reputation: 31

I also get this error

java.sql.SQLException: No suitable driver
 at java.sql.DriverManager.getDriver(Unknown Source)

and add one item .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') in SparkSession - that worked.

eg:

from pyspark import SparkContext, SparkConf
import os
from pyspark.sql.session import SparkSession

spark = SparkSession \
    .builder \
    .appName('Python Spark Postgresql') \
    .config("spark.jars", "./postgresql-42.2.18.jar") \
    .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') \
    .getOrCreate()


df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/abc") \
    .option("dbtable", 'tablename') \
    .option("user", "postgres") \
    .option("password", "1") \
    .load()

df.printSchema()

Upvotes: 3

Vishal R
Vishal R

Reputation: 1379

  1. Download postgresql jar from here:
  2. Add this to ~Spark/jars/ folder.
  3. Restart your kernel. It should work.

Upvotes: 1

musiceni
musiceni

Reputation: 31

Just initialize pyspark with --jars <path/to/your/jdbc.jar>

E.g.: pyspark --jars /path/Downloads/postgresql-42.2.16.jar

then create a dataframe as suggested above in other answers

E.g.:

df2 = spark.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/db").option("dbtable", "yourTableHere").option("user", "postgres").option("password", "postgres").option("driver", "org.postgresql.Driver").load()

Upvotes: 0

Florian Uhlmann
Florian Uhlmann

Reputation: 179

I had trouble to get a connection to the postgresDB with the jars i had on my computer. This code solved my problem with the driver

 from pyspark.sql import SparkSession
 import os

 sparkClassPath = os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'

 spark = SparkSession \
    .builder \
    .config("spark.driver.extraClassPath", sparkClassPath) \
    .getOrCreate()

 df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/yourDBname") \
    .option("driver", "org.postgresql.Driver") \
    .option("dbtable", "yourtablename") \
    .option("user", "postgres") \
    .option("password", "***") \
    .load()

df.show()

Upvotes: 2

Galuoises
Galuoises

Reputation: 3293

To use pyspark and jupyter notebook notebook: first open pyspark with

pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar  --jars /spark_drivers/postgresql-42.2.12.jar

Then in jupyter notebook

import os
jardrv = "~/spark_drivers/postgresql-42.2.12.jar"


from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', jardrv).getOrCreate()
url = 'jdbc:postgresql://127.0.0.1/dbname'
properties = {'user': 'usr', 'password': 'pswd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

Upvotes: 9

Derek Hill
Derek Hill

Reputation: 6464

One approach, building on the example per the quick start guide, is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211 argument to the spark-submit command.

This downloads the driver into ~/.ivy2/jars directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar. Passing this as the --driver-class-path option gives the full spark-submit command of:

/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\
 --packages org.postgresql:postgresql:9.4.1211\
 --driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
 --master local[4] main.py

And in main.py:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

dataframe = spark.read.format('jdbc').options(
        url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
        database='my_db',
        dbtable='my_table'
    ).load()

dataframe.show()

Upvotes: 3

Chevelle
Chevelle

Reputation: 248

It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars

Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )

and work fine in pyspark console and jupyter

Upvotes: 6

mdh
mdh

Reputation: 5563

The following worked for me with postgres on localhost:

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.

For the pyspark shell you use the SPARK_CLASSPATH environment variable:

$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark

For submitting a script via spark-submit use the --driver-class-path flag:

$ spark-submit --driver-class-path /path/to/downloaded/jar script.py

In the python script load the tables as a DataFrame as follows:

from pyspark.sql import DataFrameReader

url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = DataFrameReader(sqlContext).jdbc(
    url='jdbc:%s' % url, table='tablename', properties=properties
)

or alternatively:

df = sqlContext.read.format('jdbc').\
    options(url='jdbc:%s' % url, dbtable='tablename').\
    load()

Note that when submitting the script via spark-submit, you need to define the sqlContext.

Upvotes: 19

Shawn Guo
Shawn Guo

Reputation: 3228

This exception means jdbc driver does not in driver classpath. you can spark-submit jdbc jars with --jar parameter, also add it into driver classpath using spark.driver.extraClassPath.

Upvotes: 1

MiguelPeralvo
MiguelPeralvo

Reputation: 877

You normally need either:

  1. to install the Postgres Driver on your cluster,
  2. to provide the Postgres driver jar from your client with the --jars option
  3. or to provide the maven coordinates of the Postgres driver with --packages option.

If you detail how are you launching pyspark, we may give you more details.

Some clues/ideas:

spark-cannot-find-the-postgres-jdbc-driver

Not able to connect to postgres using jdbc in pyspark shell

Upvotes: 4

Related Questions