Spark Structured Streaming writeStream to output one global csv

Question

I am currently making a raw log data aggregator using Spark Structured Streaming.

The Inputstream is made with a directory of text files :

// == Input == //

val logsDF = spark.readStream
  .format("text")
  .option("maxFilesPerTrigger", 1)
  .load("input/*")

Logs are then parsed ...

// == Parsing == //

val logsDF2 = ...

... and aggregated

// == Aggregation == //

val windowedCounts = logsDF2
  .withWatermark("window_start", "15 minutes")
  .groupBy(
    col("window"),
    col("node")
  ).count()

Everything is working fine when I use the "console" sink : The results are updated batch by bath in the console :

// == Output == //

val query = windowedCounts.writeStream
  .format("console")
  .outputMode("complete")
  .start()
  .awaitTermination()

Now I want to save my results in one unique file (json, parquet, csv ..)

// == Output == //

val query = windowedCounts.writeStream
  .format("csv")
  .option("checkpointLocation", "checkpoint/")
  .start("output/")
  .awaitTermination()

But it outputs me 400 empty csv ... How can I get my results as I did in the console ?

Thank you very much !

Spark Structured Streaming writeStream to output one global csv

Answers (1)

Related Questions