Why Does Spark Query (Load) from Oracle Is So Slow Comparing to SQOOP?

Question

We found loading data with Spark's API from Oracle databases have been always slow since Spark 1.3 up to current Spark 2.0.1. The typical code is something in Java like this:

        Map options = new HashMap();
        options.put("url", ORACLE_CONNECTION_URL);
        options.put("dbtable", dbTable);
        options.put("batchsize", "100000");
        options.put("driver", "oracle.jdbc.OracleDriver");

        Dataset jdbcDF = sparkSession.read().options(options)
                .format("jdbc")
                .load().cache();
        jdbcDF.createTempView("my");

        //= sparkSession.sql(dbTable);
        jdbcDF.printSchema();
        jdbcDF.show();

        System.out.println(jdbcDF.count());

One of our members ever tried to customize this part and he improved a lot at the time (Spark 1.3.0). But some part of the Spark core code became internal to Spark so this cannot be used after the version. Also, we see HADOOP's SQOOP is much faster than Spark for this part (but it writes to HDFS, which will needs a lot of work to be converted to Dataset for Spark uses). Writing to Oracle using Spark's Dataset write method seems to be good for us. It is puzzling why this happens!

Why Does Spark Query (Load) from Oracle Is So Slow Comparing to SQOOP?

Answers (1)

Related Questions