Despicable me
Despicable me

Reputation: 700

Spark Sql JDBC to RDBMS get count of table efficiently

Consider spark jdbc dataframe to a rdbms table as given below

val df = spark.read("jdbc").option("url", url).option("dbtable", "schema.table").option("user", user).option("password",passwor).load()
df.count

This count action is not recomended since it will load data into spark layer and take count in the Spark layer instead of pushing down count query to jdbc datasource. What is the efficient way to get the count in this scenario?

Upvotes: 0

Views: 1244

Answers (1)

dumitru
dumitru

Reputation: 2108

Typically count will only be used once in your business logic (this is just an assumption), so the recommended way to do it is to use a standard jdbc connection and execute and sql statement that counts the rows. In this way it will be executed directly in the database and not through spark. Something like this might help you:

 val query = s"select count(*) from schema.table"
    val connection = getMySqlConnection(...)
    val rs = connection.createStatement().executeQuery(query)
    rs.next()
    val count = rs.getLong(1)
    connection.close()
    count

Upvotes: 2

Related Questions