Mehdi LAMRANI
Mehdi LAMRANI

Reputation: 11607

Spark MicroBatchExecution : Streaming query made progress... Really?

I am running delta streaming queries and I keep getting zillions of updates and QueryProgressEvent intercepted by StreamingQueryListener while "Nothing" is really happening - or so it seems.

Why are those event fired if no rows were detected to be processed ? What is considered a "progress" ?

To me this is just log pollution that is uncalled for, and I had to find a way to mute it until something is "really" happening, but I still am curious on the why and how.

20/01/01 23:18:21 INFO MicroBatchExecution: Streaming query made progress: {
  "id" : "bca1d3d2-4196-4e89-9dcf-916536bd00a6",
  "runId" : "2e6bfbef-cea1-48dd-b228-39f7fdc09e27",
  "name" : "STREAM_DELTA",
  "timestamp" : "2020-01-01T23:18:21.950Z",
  "batchId" : 1,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
    "getOffset" : 1,
    "triggerExecution" : 1
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "FileStreamSource[file:/delta/source]",
    "startOffset" : {
      "logOffset" : 0
    },
    "endOffset" : {
      "logOffset" : 0
    },
    "numInputRows" : 0,
    "inputRowsPerSecond" : 0.0,
    "processedRowsPerSecond" : 0.0
  } ],
  "sink" : {
    "description" : "DeltaSink[/delta/output]"
  }
}

Upvotes: 0

Views: 1706

Answers (1)

Zack
Zack

Reputation: 2466

Why are those event fired if no rows were detected to be processed ?

Structured Streaming is not event driven. A structured stream runs either continuously or through micro batching.

  • Continuous: Your stream is running nonstop. As soon as one run ends, the next begins.
  • Microbatching: Your stream runs on an interval based on your trigger rule (say 5 seconds). When one stream run ends, it waits 5 seconds until re-running.

In either case, the stream always checks to see if there any new files in its input location to process. If there are new files, it processes them as configured and writes the file names to its checkpoint so that those files are not re-processed as new. If there are no new files, it finishes the run as it sees that there is no work to. That is why these events are fired even if no rows are detected.

What is considered a "progress" ?

Progress is seen as the conclusion of a successful run, as shown by the logs you posted. The stream made "progress" by running.

Upvotes: 1

Related Questions