Reputation: 773
Im studing about the phases of the catalyst optimizer but Im with some doubts how the three first phases work in practice.
In the first phase (analysis phase) the otimizer will create a logical plan of the query. But here the columns are unresolved so it need to use a catalog object for this.
Doubt: Do you know how this catalog object works so solve this, for example if we are executing queries over hive tables, the optimizer connects to the hivetables in hdfs to resolve the columns?
In the second phase (logic optimization) the otimizer applies standard rules to the logical plan like constant folding, predicate pushdowns and project pruning.
Doubt: Im trying to find examples to understand better what spark really does in this phase, how constant folding, predicate pushsdown and project pruning things help to optimize the query but Im not finding nothing in concrete about this.
In the third phase (physical planning) spark takes the logical plan and generate one or more physical plans, using physical operators that match the Spark execution engine.
Doubt: Do you understand this part "using physical operators that match the spark execution engine"?
Upvotes: 1
Views: 1119
Reputation: 330063
Do you know how this catalog object works so solve this, for example if we are executing queries over hive tables, the optimizer connects to the hivetables in hdfs to resolve the columns?
There is no single answer here. Basic catalog is SessionCatalog
which serves only as a proxy to actual ExternalCatalog
. Spark provides two different implementations of the ExternalCatalog
out-of-the-box: InMemoryCatalog
and HiveExternalCatalog
which correspond to standard SQLContext
and HiveContext
respectively. Obviously the former one may access Hive metastore but there should be no data access otherwise.
In Spark 2.0+ catalog can be queried directly using SparkSession.catalog
for example:
val df = Seq(("a", 1), ("b", 2)).toDF("k", "v")
// df: org.apache.spark.sql.DataFrame = [k: string, v: int]
spark.catalog.listTables
// org.apache.spark.sql.Dataset[org.apache.spark.sql.catalog.Table] =
// [name: string, database: string ... 3 more fields]
constant folding
This is not in any particular way specific to Catalyst. It is just a standard compilation technique and its benefits should be obvious. It is better to compute expression once than repeat this for each row
predicate pushdown
Predicates correspond to WHERE
clause in the SQL query. If these can be use directly be external system (like relational database) or for partition pruning (like in Parquet) this means reduced amount of data that has to be transfered / loaded from disk.
and project pruning
Benefits are pretty much the same as for predicate pushdown. If some columns are not used downstream data source may discard this on read.
Do you understand this part using physical operators
DataFrame
is just a high level abstraction. Internally operations have to be translated to basic operations on RDDs, usually some combination of mapPartitions
.
Upvotes: 3