ovnia
ovnia

Reputation: 2500

Apache Spark DataFrame no RDD partitioning

According to new Spark Docs, using Spark's DataFrame should be preferred over using JdbcRDD.

First touch was pretty enjoyable until I faced first problem - DataFrame has no flatMapToPair() method. The first mind was to convert it into JavaRDD and I did it.

Everything was fine, I wrote my code using this approach and that noticed that such code:

JavaRDD<Row> myRDD = dataFrame.toJavaRDD();
int amount = myRDD.partitions().length

produces 1. All code below such transformation to JavaRDD is absolutely inefficient. Force repartitioning of RDD takes a good piece of time and makes bigger overhead than code, that works with 1 partition.

How to deal with it?

While using JdbcRDD we wrote specific SQL with "pager" like WHERE id >= ? and id <= ? that was used to create partitions. How to make something like this using DataFrame?

Upvotes: 2

Views: 776

Answers (1)

user3027745
user3027745

Reputation: 24

`

 val connectionString` = "jdbc:oracle:thin:username/[email protected]:1521:ORDERS"                                          
 val ordersDF = sqlContext.load("jdbc", 
                       Map( "url" -> connectionString,
                            "dbtable" -> "(select *  from CUSTOMER_ORDERS)",
                            "partitionColumn" -> "ORDER_ID",
                            "lowerBound"-> "1000",
                            "upperBound" -> "40000",
                            "numPartitions"-> "10"))    

Upvotes: 1

Related Questions