Reputation: 700
Consider spark jdbc dataframe to a rdbms table as given below
val df = spark.read("jdbc").option("url", url).option("dbtable", "schema.table").option("user", user).option("password",passwor).load()
df.count
This count action is not recomended since it will load data into spark layer and take count in the Spark layer instead of pushing down count query to jdbc datasource. What is the efficient way to get the count in this scenario?
Upvotes: 0
Views: 1244
Reputation: 2108
Typically count will only be used once in your business logic (this is just an assumption), so the recommended way to do it is to use a standard jdbc connection
and execute and sql statement
that counts the rows. In this way it will be executed directly in the database and not through spark. Something like this might help you:
val query = s"select count(*) from schema.table"
val connection = getMySqlConnection(...)
val rs = connection.createStatement().executeQuery(query)
rs.next()
val count = rs.getLong(1)
connection.close()
count
Upvotes: 2