Omkar
Omkar

Reputation: 2423

Spark write to parquet based on alphabetical partitioning

I did research a lot about this topic. I have a dataset of 3 tb size. Following is the data schema for the table:

root
 |-- user: string (nullable = true)
 |-- attributes: array (nullable = true)
 |    |-- element: string (containsNull = true)

Each day, I get a list of users which I need attributes for. I wanted to know if I could write the above schema to a parquet file with first 2 letters of users. For example,

Omkar | [a,b,c,d,e]
Mac   | [a,b,c,d,e]
Zee   | [a,b,c,d,e]
Kim   | [a,b,c,d,e]
Kelly | [a,b,c,d,e]

On the above dataset, can I do something like this:

spark.write.mode("overwrite").partitionBy("user".substr(0,2)).parquet("path/to/location")

Doing this, I feel that the data loaded onto memory while joining the users next time will be very less as we can hit only those partitions.

Any comments if someone has implemented like this?

Thanks!!

Upvotes: 2

Views: 520

Answers (1)

user9807096
user9807096

Reputation:

You can. Just replace your code with:

df
  .withColumn("prefix", $"user".substr(0,2)) // Add prefix column
  .write.mode("overwrite")            
  .partitionBy("prefix")                     // Use it for partitioning 
  .parquet("path/to/location")

Upvotes: 1

Related Questions