Reputation: 1426
I'm looking for a workaround. Sometimes our automated framework will read delta partitions, that does not exist. It will fail because no parquet files are in this partition.
I don't want it to fail.
What I do then is :
spark_read.format('delta').option("basePath",location) \
.load('/mnt/water/green/date=20221209/object=34')
Instead, I want it to return the empty dataframe. Return a dataframe with no records.
I did that, but found it a bit cumbersome, and was wondering if there was a better way.
df = spark_read.format('delta').load(location)
folder_partition = /date=20221209/object=34'.split("/")
for folder_pruning_token in folder_partition :
folder_pruning_token_split = folder_pruning_token.split("=")
column_name = folder_pruning_token_split[0]
column_value = folder_pruning_token_split[1]
df = df .filter(df [column_name] == column_value)
Upvotes: 1
Views: 680
Reputation: 87269
You really don't need to do that trick with Delta Lake tables. This trick was primarily used for Parquet & other file formats to avoid scanning of files on HDFS or cloud storage that is very expensive.
You just need to load data, and filter data using where
/filter
. It's similar to what you do:
df = spark_read.format('delta').load(location) \
.filter("date = '20221209' and object = 34")
If you need, you can of course extract that values automatically, maybe slightly simpler code:
df = spark_read.format('delta').load(location)
folder_partition = '/date=20221209/object=34'.split("/")
cols = [f"{s[0]} = '{s[1]}'"
for s in [f.split('=')for f in folder_partition]
]
df = df.filter(" and ".join(cols))
Upvotes: 2