Reputation: 2423
I did research a lot about this topic. I have a dataset of 3 tb size. Following is the data schema for the table:
root
|-- user: string (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: string (containsNull = true)
Each day, I get a list of users which I need attributes for. I wanted to know if I could write the above schema to a parquet file with first 2 letters of users. For example,
Omkar | [a,b,c,d,e]
Mac | [a,b,c,d,e]
Zee | [a,b,c,d,e]
Kim | [a,b,c,d,e]
Kelly | [a,b,c,d,e]
On the above dataset, can I do something like this:
spark.write.mode("overwrite").partitionBy("user".substr(0,2)).parquet("path/to/location")
Doing this, I feel that the data loaded onto memory while joining the users next time will be very less as we can hit only those partitions.
Any comments if someone has implemented like this?
Thanks!!
Upvotes: 2
Views: 520
Reputation:
You can. Just replace your code with:
df
.withColumn("prefix", $"user".substr(0,2)) // Add prefix column
.write.mode("overwrite")
.partitionBy("prefix") // Use it for partitioning
.parquet("path/to/location")
Upvotes: 1