Spark DataFrame not executing group-by statements within a JDBC data source

Question

I've registered a MySQL data source as follows:

val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://address=(protocol=tcp)(host=myhost)(port=3306)(user=)(password=)/dbname"

val jdbcDF = sqlContext.load("jdbc", Map(
  "url" -> url,
  "driver" -> driver,
  "dbtable" -> "videos"))

jdbcDF.registerTempTable("videos")

and then executed the following Spark SQL query:

select
   uploader, count(*) as items
from
   videos_table
where
   publisher_id = 154
group by
   uploader
order by
   items desc

This call actually executes the following query on the MySQL server:

SELECT uploader,publisher_id FROM videos WHERE publisher_id = 154

and then loads the data to the Spark cluster and performs the group-by as a Spark operation.

This behavior is problematic due to the excess network traffic created by not performing the group-by on the MySQL server. Is there a way to force the DataFrame to run the literal query on the MySQL server?

Spark DataFrame not executing group-by statements within a JDBC data source

Answers (1)

Related Questions