Reputation: 327
I have an iterative transformation applied to a dataframe, it used to take a long time and having done lots of research online, it appears the issue was due to the DAG from growing exponentially. To fix this I cme across a solution which was to break the lineage by converting the dataframe to RDD and then back to pyspark again after each transformation in a loop. This works wonders when applied to a normal table, but now I'm using DLT and I'm getting this error:
Queries with streaming sources must be executed with writeStream.start();
Is there any way to resolve this?
Upvotes: 0
Views: 86
Reputation: 3250
The ERROR indicates that the foreachBatch
operation is not recognized by Databricks Live Tables (DLT).
Queries with streaming sources must be executed with writeStream.start();
The foreachBatch operation is not supported in Databricks Live Tables (DLT) for streaming queries.
To work around this limitation, you can take the following approach:
Instead of writing directly to the target table (DLT )within the foreachBatch
operation, write the intermediate results to a temporary table.
After processing each micro-batch, store the results in this temporary table.
Finally, use a separate job or process to periodically merge the data from the temporary table into your target table.
Reference: DLT fails with Queries with streaming sources must be executed with writeStream.start();
Upvotes: 0