phildeutsch
phildeutsch

Reputation: 683

Connecting from Spark/pyspark to PostgreSQL

I've installed Spark on a Windows machine and want to use it via Spyder. After some troubleshooting the basics seems to work:

import os

os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6"

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config) 
sqlContext = SQLContext(sc)

textFile = sc.textFile("D:\\Analytics\\Spark\\spark-1.4.0-bin-hadoop2.6\\README.md")
textFile.count()
textFile.filter(lambda line: "Spark" in line).count()

sc.stop()

This runs as expected. I now want to connect to a Postgres9.3 database running on the same server. I have downloaded the JDBC driver from here here and have put it in the folder D:\Analytics\Spark\spark_jars. I've then created a new file D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6\conf\spark-defaults.conf containing this line:

spark.driver.extraClassPath        'D:\\Analytics\\Spark\\spark_jars\\postgresql-9.3-1103.jdbc41.jar'

I've ran the following code to test the connection

import os

os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6"

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config) 
sqlContext = SQLContext(sc)

df = (sqlContext
    .load(source="jdbc",
          url="jdbc:postgresql://[hostname]/[database]?user=[username]&password=[password]",
          dbtable="pubs")
 )
sc.stop()

But am getting the following error:

Py4JJavaError: An error occurred while calling o22.load.
: java.sql.SQLException: No suitable driver found for     jdbc:postgresql://uklonana01/stonegate?user=analytics&password=pMOe8jyd
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118)
at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:128)
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:113)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Unknown Source)

How can I check whether I've downloaded the right .jar file or where else the error might come from?

Upvotes: 6

Views: 10026

Answers (3)

alvaro nortes
alvaro nortes

Reputation: 630

Another way to connect pyspark with your postrgresql db.

1) Install spark with pip: pip install pyspark

2) Download last version of jdbc postgresql connector in: https://jdbc.postgresql.org/download.html

3) Complete this code with your db credentials:

from __future__ import print_function
from pyspark.sql import SparkSession


def jdbc_dataset_example(spark):
    df = spark.read \
        .jdbc("jdbc:postgresql://[your_db_host]:[your_db_port]/[your_db_name]",
              "com_dim_city",
              properties={"user": "[your_user]", "password": "[your_password]"})

    df.createOrReplaceTempView("[your_table]")

    sqlDF = spark.sql("SELECT * FROM [your_table] LIMIT 10")
    sqlDF.show()


if __name__ == "__main__":
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL data source example") \
        .getOrCreate()
    jdbc_dataset_example(spark)
    spark.stop()

Finally run your aplication with:

spark-submit --driver-class-path /path/to/your_jdbc_jar/postgresql-42.2.6.jar --jars postgresql-42.2.6.jar /path/to/your_jdbc_jar/test_pyspark_to_postgresql.py 

Upvotes: 1

Sheng
Sheng

Reputation: 834

I have tried SPARK_CLASSPATH environment variable but it doesn't work with Spark 1.6.

Other answers from posts like below suggested adding pyspark command arguments and it works.

Not able to connect to postgres using jdbc in pyspark shell

Apache Spark : JDBC connection not working

pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>

Upvotes: 3

phildeutsch
phildeutsch

Reputation: 683

Remove spark-defaults.conf and add the SPARK_CLASSPATH to the system environment in python like this:

os.environ["SPARK_CLASSPATH"] = 'PATH\\TO\\postgresql-9.3-1101.jdbc41.jar'

Upvotes: 1

Related Questions