Jas Bali
Jas Bali

Reputation: 241

How does Apache spark sql lineage evolve as we run sql updates on an dataframe?

I am trying to develop a backend module, which would require me to do several sql updates on a DataFrame, backed by a parquet format in hdfs. What I am interested in knowing is how does multiple sql updates affect DataFrame's RDD lineage and also, would it be a concern to execute several, frequent sql updates on a DataFrame, since according to my understanding a single sql update on a DataFrame would be a transformation. Is there anything equivalent of doing a batch update to a dataframe in a single lineage step?

Upvotes: 0

Views: 367

Answers (1)

zero323
zero323

Reputation: 330203

Two important notes:

  • Spark DataFrames are immutable and as such cannot be updated. You can only create a new DataFrame.
  • Transformations and lineage are RDD specific. While internally every set of operations on a DataFrame (Dataset) is translated to some DAG and executed using RDD there is no trivial correspondence between RDD stages and methods you apply on Dataset. Operators can be transparently rearranged, removed or squashed together. How exactly query is transformed is not a part of contract and if you're interested in details for a given version you should check the execution plan as well as explain DAG of the corresponding RDD.

    In general you can expect that individual operations may require between zero (if particular operation is eliminated by projections or usage of trivial predicates) and two stages (typical aggregations). Projections are typically scheduled together if possible, aggregation behavior changed over time.

    Finally some operation may require more than one job to infer schema or compute statistics.

Upvotes: 3

Related Questions