Reputation: 3288
I have a Dataframe with rows which will be saved to different target tables. Right now, I'm finding the unique combination of parameters to determine the target table, iterating over the Dataframe and filtering, then writing.
Something similar to this:
df = spark.load.json(directory).repartition('client', 'region')
unique_clients_regions = [(group.client, group.region) for group in df.select('client', 'region').distinct().collect()]
for client, region in unique_clients_regions:
(df
.filter(f"client = '{client}' and region = '{region}'")
.select(
...
)
.write.mode("append")
.saveAsTable(f"{client}_{region}_data")
)
Is there a way to map the write operation to different groupBy
groups instead of having to iterate over the distinct set? I made sure to repartition by client
and region
to try and speed up performance of the filter.
Upvotes: 0
Views: 328
Reputation: 15293
I cannot, in good conscience , advice you anything using this solution. Actually that's a really bad data architecture.
You should have only one table and partition by client and region. That will create different folders for each couple client/region. And you only need one write in the end and no loop nor collect.
spark.read.json(directory).write.saveAsTable(
"data",
mode="append",
partitionBy=['client', 'region']
)
Upvotes: 1