Reputation: 23
I'm using spark structured streaming to process data from a streaming data source, and I'm using a file sink. Data will be put into hdfs after processing.
I've got a problem that output file is something like part-00012-8d701427-8289-41d7-9b4d-04c5d882664d-c000.txt
. This makes me impossible get files output during last hour.
Is is possible to customize the output file into timestamp_xxx or something like this? Or, can I output into different path by each batch?
Upvotes: 2
Views: 3931
Reputation: 2333
I believe this file format is an internal thing that is used by Spark for storing down the values for each partition. If you are using some sort of blob store (sorry I am windows user) you should still just be able to load the files back from output location and then work on them again using DataFrame.
What I am trying to say is although you don't have much say in the file names, as that is something Spark does itself it should not stop you from creating your own workflow where you batch stuff where you would look inside the files for some timestamp (I am assuming the out file contents has some sort of DataTime column, if it doesn't may be a good idea to add one)
That is how I would proceed with things, make the timestamp part of the file contents, and then you can use the actual file contents (as I say read into DataFrame say) and then just use normal DataFrame / Map operations on the loaded output data
I kind of roughly talk about this here.
Upvotes: 1
Reputation: 28352
You can not change the name of the saved files. However, you can change the folder structure of where it is saved. Use partitionBy()
to partition the data after specified columns in the dataset, in this case year, month, day and hour could be of interest:
df.writeStream
.format("parquet") // can be "orc", "json", "csv", etc.
.option("path", "/path/to/save/")
.partitionBy("year", "month", "day", "hour")
.start()
This will create a folder structure starting from the path
which could look as follows:
year=2018
|
|--> month=06
| |
| |--> day=26
| | |
| | |--> hour=10
| | |--> hour=11
| | |--> ...
| |
| |--> day=27
| | |
| | |--> ...
Of course, other columns could be used to partition the files depending on what is avaiable.
Upvotes: 9