Reputation: 1454
I've registered a MySQL data source as follows:
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://address=(protocol=tcp)(host=myhost)(port=3306)(user=)(password=)/dbname"
val jdbcDF = sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> "videos"))
jdbcDF.registerTempTable("videos")
and then executed the following Spark SQL query:
select
uploader, count(*) as items
from
videos_table
where
publisher_id = 154
group by
uploader
order by
items desc
This call actually executes the following query on the MySQL server:
SELECT uploader,publisher_id FROM videos WHERE publisher_id = 154
and then loads the data to the Spark cluster and performs the group-by as a Spark operation.
This behavior is problematic due to the excess network traffic created by not performing the group-by on the MySQL server. Is there a way to force the DataFrame to run the literal query on the MySQL server?
Upvotes: 1
Views: 469
Reputation: 330093
Well, it depends. Spark can push-down over JDBC only the predicates so it is not possible to dynamically execute arbitrary query on a database side. Still, it is possible to use any valid query as a table
argument so you can do something like this:
val tableQuery =
"""(SELECT uploader, count(*) as items FROM videos GROUP BY uploader) tmp"""
val jdbcDF = sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> tableQuery
))
If that's not enough you can try to create a custom data source.
Upvotes: 2