Spark write to parquet based on alphabetical partitioning

Question

I did research a lot about this topic. I have a dataset of 3 tb size. Following is the data schema for the table:

root
 |-- user: string (nullable = true)
 |-- attributes: array (nullable = true)
 |    |-- element: string (containsNull = true)

Each day, I get a list of users which I need attributes for. I wanted to know if I could write the above schema to a parquet file with first 2 letters of users. For example,

Omkar | [a,b,c,d,e]
Mac   | [a,b,c,d,e]
Zee   | [a,b,c,d,e]
Kim   | [a,b,c,d,e]
Kelly | [a,b,c,d,e]

On the above dataset, can I do something like this:

spark.write.mode("overwrite").partitionBy("user".substr(0,2)).parquet("path/to/location")

Doing this, I feel that the data loaded onto memory while joining the users next time will be very less as we can hit only those partitions.

Any comments if someone has implemented like this?

Thanks!!

Spark write to parquet based on alphabetical partitioning

Answers (1)

Related Questions