manish
manish

Reputation: 56

Spark JDBC parallelism

I am working on the use case where I need to do one time offload of JDBC Datasource in my case it is SAP Hana database. I wanted to offload entire SAP Hana database to HDFS/MapR FS. We have tried sqoop initially, but the problem with Sqoop it depends on the primary key field, and it supports only one primary key field by --split-by arguement. Then we thought to plan to utilize Spark to do Sqoop of datasets. Going through various JDBC option available in spark e.g. this post https://forums.databricks.com/questions/14963/problems-doing-parallel-read-from-jdbc.html . It also accepts only one column, whereas in case of SAP Hana tables mostly it consists of conjugate keys (multiple keys to form primary key).

  1. How spark reads the JDBC source ? does it reads all data from table and then split it by partition in memory among workers ?

  2. How is that possible to specify such an option while reading to JDBC SAP Hana source and do the parallel read there by reducing OOM errors ( in case above question # 1 is yes)

  3. Some of the SAP Hana tables doesnt even have primary keys , that is the problem in the bringing arge datasets.

Please help me in forming the right approach and stretegy.

Thanks in advance.

Manish

Upvotes: 2

Views: 1245

Answers (1)

suj1th
suj1th

Reputation: 1801

Spark SQL is capable of a limited level of predicate pushdown, and column pruning optimizations, when reading from a JDBC source. Given this, it is safe to say it will not read complete data from the JDBC table into memory; although this depends a lot on the type of extraction queries you use.

SAP HANA's Spark Controller provides integration for HANA with Spark. You will have to check for its support for tables with conjugate primary keys and no primary keys.

Upvotes: 0

Related Questions