Reputation: 383
First, please allow me to start by saying that I am pretty new to Spark-SQL
.
I am trying to understand various Join types and strategies in Spark-Sql, I wish to be able to know about an approach to approximate the sizes of the tables (which are participating in a Join, aggregations etc) in order to estimate/tune the expected execution time by understanding what is really happening under the hood to help me to pick the Join strategy which is best suited for that scenario (in Spark-SQL through hints etc).
Of course, the table row-counts offers a good starting point, but I want to be able to estimate the sizes in terms of bytes
/KB
/MB
/GB
/TB
s, to be cognizant which table would/would not fit in memory etc) which in turn would allow me to write more efficient SQL queries by choosing the Join type/strategy etc that is best suited for that scenario.
Note : I do not have access to PySpark. We query the Glue tables thru Sql Workbench connected to the Glue catalog with a Hive jdbc driver.
Any help is appreciated.
Thanks.
Upvotes: 3
Views: 18005
Reputation: 6338
see if this helps to find the size of table-
/**
* file content
* spark-test-data.json
* --------------------
* {"id":1,"name":"abc1"}
* {"id":2,"name":"abc2"}
* {"id":3,"name":"abc3"}
*/
val fileName = "spark-test-data.json"
val path = getClass.getResource("/" + fileName).getPath
spark.catalog.createTable("df", path, "json")
.show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |abc1|
* |2 |abc2|
* |3 |abc3|
* +---+----+
*/
// Collect only statistics that do not require scanning the whole table (that is, size in bytes).
spark.sql("ANALYZE TABLE df COMPUTE STATISTICS NOSCAN")
spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)
/**
* +----------+---------+-------+
* |col_name |data_type|comment|
* +----------+---------+-------+
* |Statistics|68 bytes | |
* +----------+---------+-------+
*/
spark.sql("ANALYZE TABLE df COMPUTE STATISTICS")
spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)
/**
* +----------+----------------+-------+
* |col_name |data_type |comment|
* +----------+----------------+-------+
* |Statistics|68 bytes, 3 rows| |
* +----------+----------------+-------+
*/
Upvotes: 3