Reputation: 842
I am able to develop a pipeline which reads from kafka does some transformations and write the output to kafka sink as well as parque sink. I would like adding effective logging to log the intermediate results of the transformation like in a regular streaming application.
One option I see is to log the queryExecutionstreams
via
df.queryExecution.analyzed.numberedTreeString
or
logger.info("Query progress"+ query.lastProgress)
logger.info("Query status"+ query.status)
But this doesn't seem to have a way to see the business specific messages on which the stream is running on.
Is there a way how I can add more logging info like the data which it's processing?
Upvotes: 4
Views: 1886
Reputation: 842
I found some options to track the same .Basically we can name our streaming query using df.writeStream.format("parquet") .queryName("table1")
.
The query name table1
will be printed in the Spark Jobs Tab against the Completed Jobs list in the Spark UI from which you can track the status for each of the streaming queries
ProgressReporter
API in Structured Streaming to collect more stats.Upvotes: 2