DanaMihai
DanaMihai

Reputation: 61

How join in SparkSQL data from mysql and Oracle?

Is it possible in SparkSQL to join the data from mysql and Oracle databases? I tried to join them, but I have some troubles with set the multiple jars (jdbc drivers for mysql and Oracle) in SPARK_CLASSPATH. Here is my code:

import os
import sys


os.environ['SPARK_HOME']="/home/x/spark-1.5.2"
sys.path.append("/home/x/spark-1.5.2/python/")
try:
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SQLContext
    MYSQL_DRIVER_PATH = "/home/x/spark-1.5.2/python/lib/mysql-connector-java-5.1.38-bin.jar"
    MYSQL_CONNECTION_URL = "jdbc:mysql://192.111.333.999:3306/db?user=us&password=pasw"

    ORACLE_DRIVER_PATH = "/home/x/spark-1.5.2/python/lib/ojdbc6.jar"
    Oracle_CONNECTION_URL = "jdbc:oracle:thin:user/[email protected]:1521:xe"

    # Define Spark configuration
    conf = SparkConf()
    conf.setMaster("local")
    conf.setAppName("MySQL_Oracle_imp_exp")

    # Initialize a SparkContext and SQLContext
    sc = SparkContext(conf=conf)
    #sc.addJar(MYSQL_DRIVER_PATH)
    sqlContext = SQLContext(sc)
    ora_tmp=sqlContext.read.format('jdbc').options(
        url=Oracle_CONNECTION_URL,
        dbtable="TABLE1",
        driver="oracle.jdbc.OracleDriver"
        ).load()
    ora_tmp.show()

    tmp2=sqlContext.load(
        source="jdbc",
        path=MYSQL_DRIVER_PATH,
        url=MYSQL_CONNECTION_URL,
        dbtable="(select city,zip from TABLE2 limit 10) as tmp2",
        driver="com.mysql.jdbc.Driver")
    c_rows=tmp2.collect()
   ....
except Exception as e:
    print  e
    sys.exit(1)

Could someone please help me to solve this problem? Thanks in advance :)

Upvotes: 1

Views: 1027

Answers (1)

Durga Viswanath Gadiraju
Durga Viswanath Gadiraju

Reputation: 3956

Here are the steps you need to follow:

  1. First register SPARK_CLASSPATH to jars of one of the databases say mysql using command os.environ['SPARK_CLASSPATH'] = "/usr/share/java/mysql-connector-java.jar"
  2. Run query against mysql database and assign to RDD
  3. Register SPARK_CLASSPATH with jars of second database by changing the path from above command
  4. Run query against second database

If you have issues with lazy evaluation, make sure you first write first data set into files and then proceed further.

Upvotes: 1

Related Questions