TomNash
TomNash

Reputation: 3288

Map write operation to groups of Dataframe rows to different delta tables

I have a Dataframe with rows which will be saved to different target tables. Right now, I'm finding the unique combination of parameters to determine the target table, iterating over the Dataframe and filtering, then writing.

Something similar to this:

df = spark.load.json(directory).repartition('client', 'region')

unique_clients_regions = [(group.client, group.region) for group in df.select('client', 'region').distinct().collect()]

for client, region in unique_clients_regions:
  (df
   .filter(f"client = '{client}' and region = '{region}'")
   .select(
     ...
   )
   .write.mode("append")
   .saveAsTable(f"{client}_{region}_data") 
  )

Is there a way to map the write operation to different groupBy groups instead of having to iterate over the distinct set? I made sure to repartition by client and region to try and speed up performance of the filter.

Upvotes: 0

Views: 328

Answers (1)

Steven
Steven

Reputation: 15293

I cannot, in good conscience , advice you anything using this solution. Actually that's a really bad data architecture.
You should have only one table and partition by client and region. That will create different folders for each couple client/region. And you only need one write in the end and no loop nor collect.

spark.read.json(directory).write.saveAsTable(
    "data",
    mode="append",
    partitionBy=['client', 'region']
)

Upvotes: 1

Related Questions