How to define a spark structured streaming file sink file path or file name?

Question

I'm using spark structured streaming to process data from a streaming data source, and I'm using a file sink. Data will be put into hdfs after processing.

I've got a problem that output file is something like part-00012-8d701427-8289-41d7-9b4d-04c5d882664d-c000.txt. This makes me impossible get files output during last hour.

Is is possible to customize the output file into timestamp_xxx or something like this? Or, can I output into different path by each batch?

sacha barber · Accepted Answer

I believe this file format is an internal thing that is used by Spark for storing down the values for each partition. If you are using some sort of blob store (sorry I am windows user) you should still just be able to load the files back from output location and then work on them again using DataFrame.

What I am trying to say is although you don't have much say in the file names, as that is something Spark does itself it should not stop you from creating your own workflow where you batch stuff where you would look inside the files for some timestamp (I am assuming the out file contents has some sort of DataTime column, if it doesn't may be a good idea to add one)

That is how I would proceed with things, make the timestamp part of the file contents, and then you can use the actual file contents (as I say read into DataFrame say) and then just use normal DataFrame / Map operations on the loaded output data

I kind of roughly talk about this here.

How to define a spark structured streaming file sink file path or file name?

Answers (2)

Related Questions