Reputation: 575
We found loading data with Spark's API from Oracle databases have been always slow since Spark 1.3 up to current Spark 2.0.1. The typical code is something in Java like this:
Map<String, String> options = new HashMap<String, String>();
options.put("url", ORACLE_CONNECTION_URL);
options.put("dbtable", dbTable);
options.put("batchsize", "100000");
options.put("driver", "oracle.jdbc.OracleDriver");
Dataset<Row> jdbcDF = sparkSession.read().options(options)
.format("jdbc")
.load().cache();
jdbcDF.createTempView("my");
//= sparkSession.sql(dbTable);
jdbcDF.printSchema();
jdbcDF.show();
System.out.println(jdbcDF.count());
One of our members ever tried to customize this part and he improved a lot at the time (Spark 1.3.0). But some part of the Spark core code became internal to Spark so this cannot be used after the version. Also, we see HADOOP's SQOOP is much faster than Spark for this part (but it writes to HDFS, which will needs a lot of work to be converted to Dataset for Spark uses). Writing to Oracle using Spark's Dataset write method seems to be good for us. It is puzzling why this happens!
Upvotes: 11
Views: 4979
Reputation: 308
Well @Pau Z Wu already answered the question in the comments but the problem wasoptions.put("batchsize", "100000");
This needed to be options.put("fetchsize", "100000");
since fetch size deals with limiting the amount of rows retrived from the database at a time and would end up making the load time faster.
More information can be found here: https://docs.oracle.com/cd/A87860_01/doc/java.817/a83724/resltse5.htm
Upvotes: 10