spark partition data writing by timestamp

Question

I have some data which has timestamp column field which is long and its epoch standard , I need to save that data in split-ted format like yyyy/mm/dd/hh using spark scala

data.write.partitionBy("timestamp").format("orc").save("mypath")

this is just splitting the data by timestamp like below

timestamp=1458444061098
timestamp=1458444061198

but I want it to be as

└── YYYY
    └── MM
        └── DD
            └── HH

Constantine · Accepted Answer

You can leverage various spark sql date/time functions for this. First, you add a new date type column created from the unix timestamp column.

val withDateCol = data
.withColumn("date_col", from_unixtime(col("timestamp"), "YYYYMMddHH"))

After this, you can add year, month, day and hour columns to the DF and then partition by these new columns for the write.

withDateCol
.withColumn("year", year(col("date_col")))
.withColumn("month", month(col("date_col")))
.withColumn("day", dayofmonth(col("date_col")))
.withColumn("hour", hour(col("date_col")))
.drop("date_col")
.partitionBy("year", "month", "day", "hour")
.format("orc")
.save("mypath")

The columns included in the partitionBy clause wont be part of the file schema.

spark partition data writing by timestamp

Answers (2)

Related Questions