nojka_kruva
nojka_kruva

Reputation: 1454

Spark DataFrame not executing group-by statements within a JDBC data source

I've registered a MySQL data source as follows:

val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://address=(protocol=tcp)(host=myhost)(port=3306)(user=)(password=)/dbname"

val jdbcDF = sqlContext.load("jdbc", Map(
  "url" -> url,
  "driver" -> driver,
  "dbtable" -> "videos"))

jdbcDF.registerTempTable("videos")

and then executed the following Spark SQL query:

select
   uploader, count(*) as items
from
   videos_table
where
   publisher_id = 154
group by
   uploader
order by
   items desc

This call actually executes the following query on the MySQL server:

SELECT uploader,publisher_id FROM videos WHERE publisher_id = 154

and then loads the data to the Spark cluster and performs the group-by as a Spark operation.

This behavior is problematic due to the excess network traffic created by not performing the group-by on the MySQL server. Is there a way to force the DataFrame to run the literal query on the MySQL server?

Upvotes: 1

Views: 469

Answers (1)

zero323
zero323

Reputation: 330093

Well, it depends. Spark can push-down over JDBC only the predicates so it is not possible to dynamically execute arbitrary query on a database side. Still, it is possible to use any valid query as a table argument so you can do something like this:

val tableQuery = 
 """(SELECT uploader, count(*) as items FROM videos GROUP BY uploader) tmp"""

val jdbcDF = sqlContext.load("jdbc", Map(
  "url" -> url,
  "driver" -> driver,
  "dbtable" -> tableQuery
))

If that's not enough you can try to create a custom data source.

Upvotes: 2

Related Questions