What are the best practices to partition Parquet files by timestamp in Spark?

Question

I'm pretty new to Spark (2 days) and I'm pondering the best way to partition parquet files.

My rough plan ATM is:

read in the source TSV files with com.databricks.spark.csv (these have a TimeStampType column)
write out parquet files, partitioned by year/month/day/hour
use these parquet files for all the queries that'll then be occurring in future

It's been ludicrously easy (kudos to Spark devs) to get a simple version working - except for partitioning the way I'd like to. This is in python BTW:

input = sqlContext.read.format('com.databricks.spark.csv').load(source, schema=myschema)
input.write.partitionBy('type').format("parquet").save(dest, mode="append")

Is the best approach to map the RDD, adding new columns for year, month, day, hour and then use PartitionBy? Then for any queries we have to manually add year/month etc? Given how elegant I've found spark to be so far, this seems a little odd.

Thanks

What are the best practices to partition Parquet files by timestamp in Spark?

Answers (1)

Related Questions