Spark JDBC parallelism

Question

I am working on the use case where I need to do one time offload of JDBC Datasource in my case it is SAP Hana database. I wanted to offload entire SAP Hana database to HDFS/MapR FS. We have tried sqoop initially, but the problem with Sqoop it depends on the primary key field, and it supports only one primary key field by --split-by arguement. Then we thought to plan to utilize Spark to do Sqoop of datasets. Going through various JDBC option available in spark e.g. this post https://forums.databricks.com/questions/14963/problems-doing-parallel-read-from-jdbc.html . It also accepts only one column, whereas in case of SAP Hana tables mostly it consists of conjugate keys (multiple keys to form primary key).

How spark reads the JDBC source ? does it reads all data from table and then split it by partition in memory among workers ?
How is that possible to specify such an option while reading to JDBC SAP Hana source and do the parallel read there by reducing OOM errors ( in case above question # 1 is yes)
Some of the SAP Hana tables doesnt even have primary keys , that is the problem in the bringing arge datasets.

Please help me in forming the right approach and stretegy.

Thanks in advance.

Manish

Spark JDBC parallelism

Answers (1)

Related Questions