Reputation: 2500
According to new Spark Docs, using Spark's DataFrame
should be preferred over using JdbcRDD
.
First touch was pretty enjoyable until I faced first problem - DataFrame
has no flatMapToPair()
method. The first mind was to convert it into JavaRDD
and I did it.
Everything was fine, I wrote my code using this approach and that noticed that such code:
JavaRDD<Row> myRDD = dataFrame.toJavaRDD();
int amount = myRDD.partitions().length
produces 1
. All code below such transformation to JavaRDD
is absolutely inefficient. Force repartitioning of RDD takes a good piece of time and makes bigger overhead than code, that works with 1 partition.
How to deal with it?
While using JdbcRDD
we wrote specific SQL with "pager" like WHERE id >= ? and id <= ?
that was used to create partitions. How to make something like this using DataFrame
?
Upvotes: 2
Views: 776
Reputation: 24
`
val connectionString` = "jdbc:oracle:thin:username/[email protected]:1521:ORDERS"
val ordersDF = sqlContext.load("jdbc",
Map( "url" -> connectionString,
"dbtable" -> "(select * from CUSTOMER_ORDERS)",
"partitionColumn" -> "ORDER_ID",
"lowerBound"-> "1000",
"upperBound" -> "40000",
"numPartitions"-> "10"))
Upvotes: 1