foreach() method with Spark Streaming errors

Question

I'm trying to write data pulled from a Kafka to a Bigquery table every 120 seconds. I would like to do some additional operations which by documentation should be possible inside the .foreach() or foreachBatch() method.

As a test i wanted to print a simple message every time data get's pulled from kafka and written to BigQuery.

batch_job=df_alarmsFromKafka.writeStream\
.trigger(processingTime='120 seconds') \
.foreachBatch(print("do i get printed every batch?"))
.format("bigquery").outputMode("append") \
.option("temporaryGcsBucket",path1) \
.option("checkpointLocation",path2) \
.option("table", table_kafka) \
.start()
batch_job.awaitTermination()

I would expect this message to be printed every 120 secs on jupyter Lab output cell, instead it gets printed only once and just keeps writing to BigQuery.

If i try to use .foreach() instead of foreachBatch()

batch_job=df_alarmsFromKafka.writeStream\
.trigger(processingTime='120 seconds') \
.foreach(print("do i get printed every batch?"))
.format("bigquery").outputMode("append") \
.option("temporaryGcsBucket",path1) \
.option("checkpointLocation",path2) \
.option("table", table_kafka) \
.start()
batch_job.awaitTermination()

it prints the message once and immediately after gives the following error, which i could not debug/understard:

/usr/lib/spark/python/pyspark/sql/streaming.py in foreach(self, f)
   1335 
   1336             if not hasattr(f, 'process'):
-> 1337                 raise Exception("Provided object does not have a 'process' method")
   1338 
   1339             if not callable(getattr(f, 'process')):

Exception: Provided object does not have a 'process' method

Am i doing something wrong? how can i simply do some operations every 120 secs other than those performed directly on the dataframe evaluated df_alarmsFromKafka?

foreach() method with Spark Streaming errors

Answers (1)

Related Questions