Apache Spark DataFrame no RDD partitioning

Question

According to new Spark Docs, using Spark's DataFrame should be preferred over using JdbcRDD.

First touch was pretty enjoyable until I faced first problem - DataFrame has no flatMapToPair() method. The first mind was to convert it into JavaRDD and I did it.

Everything was fine, I wrote my code using this approach and that noticed that such code:

JavaRDD myRDD = dataFrame.toJavaRDD();
int amount = myRDD.partitions().length

produces 1. All code below such transformation to JavaRDD is absolutely inefficient. Force repartitioning of RDD takes a good piece of time and makes bigger overhead than code, that works with 1 partition.

How to deal with it?

While using JdbcRDD we wrote specific SQL with "pager" like WHERE id >= ? and id <= ? that was used to create partitions. How to make something like this using DataFrame?

user3027745 · Accepted Answer

`

 val connectionString` = "jdbc:oracle:thin:username/password@111.11.1.11:1521:ORDERS"                                          
 val ordersDF = sqlContext.load("jdbc", 
                       Map( "url" -> connectionString,
                            "dbtable" -> "(select *  from CUSTOMER_ORDERS)",
                            "partitionColumn" -> "ORDER_ID",
                            "lowerBound"-> "1000",
                            "upperBound" -> "40000",
                            "numPartitions"-> "10"))

Apache Spark DataFrame no RDD partitioning

Answers (1)

Related Questions