Reputation: 188
I want to partitionBy date instead of time in my Spark Dataframe, how can I do that?
Suppose I have dataframe with the first column as datetime, like '2019-06-25 12:00:00', '2019-06-25 11:00:00', etc. I know how to partion it by time, but I don't know how to partition it by date.
spark = SparkSession.builder.getOrCreate()
df = spark.sparkContext.parallelize([
Row("2019-06-25 12:00:00", "2"), Row("2019-06-25 11:00:00", "a"),
Row("2019-06-24 02:03:10", "2"), Row("2019-06-22 08:00:00", "b"),
Row("2019-03-12 08:01:34", "3")]).toDF(["datetime", "val"])
+-------------------+---+
| date|val|
+-------------------+---+
|2019-06-25 12:00:00| 2|
|2019-06-25 11:00:00| a|
|2019-06-24 02:03:10| 2|
|2019-06-22 08:00:00| b|
|2019-03-12 08:01:34| 3|
+-------------------+---+
I want to use such methods as
df.write.partitionBy(substr('datetime', 10)).save(path='...', mode='...')
to achieve my goal, but apparently the above won't work.
Upvotes: 2
Views: 3452
Reputation: 2483
In this kind of Situation's you can simply add a new column based on your "datetime" field let's say "date_only"
The snippet for your code will be like
1) Register your Dataframe either from source SQL or any flat file system etc. In this case let's consider this below sequence.
df = spark.sparkContext.parallelize([
("2019-06-25 12:00:00", "2"), ("2019-06-25 11:00:00", "a"),
("2019-06-24 02:03:10", "2"), ("2019-06-22 08:00:00", "b"),
("2019-03-12 08:01:34", "3")]).toDF(["datetime", "val"])
2) Preparing up a new Dataframe from the source one this will allow you to have a new column and the existing column won't be gone from your resultant files in the Partitions.
from pyspark.sql import functions as func
partitioned_df = df.withColumn("date_only", func.to_date(func.col("datetime")))
3) Saving the data in Append mode into those partitions.
partitioned_df.write.partitionBy('date_only').save(path='dbfs:/FileStore/tables/Temp', mode='append')
4) I tried the following in Databricks and the Taxonomy looks like this:
5) Also the Parquet (Snappy) file contains the result like this:
Kindly let me know if this resolves your problem.
Upvotes: 2