Reputation: 241
I am trying to develop a backend module, which would require me to do several sql updates on a DataFrame, backed by a parquet format in hdfs. What I am interested in knowing is how does multiple sql updates affect DataFrame's RDD lineage and also, would it be a concern to execute several, frequent sql updates on a DataFrame, since according to my understanding a single sql update on a DataFrame would be a transformation. Is there anything equivalent of doing a batch update to a dataframe in a single lineage step?
Upvotes: 0
Views: 367
Reputation: 330203
Two important notes:
DataFrames
are immutable and as such cannot be updated. You can only create a new DataFrame
.Transformations and lineage are RDD specific. While internally every set of operations on a DataFrame
(Dataset
) is translated to some DAG and executed using RDD
there is no trivial correspondence between RDD
stages and methods you apply on Dataset
. Operators can be transparently rearranged, removed or squashed together. How exactly query is transformed is not a part of contract and if you're interested in details for a given version you should check the execution plan as well as explain
DAG of the corresponding RDD.
In general you can expect that individual operations may require between zero (if particular operation is eliminated by projections or usage of trivial predicates) and two stages (typical aggregations). Projections are typically scheduled together if possible, aggregation behavior changed over time.
Finally some operation may require more than one job to infer schema or compute statistics.
Upvotes: 3