Reputation: 3101
We are using kedro in our project. Normally, one can define datasets as such:
client_table:
type: spark.SparkDataSet
filepath: ${base_path_spark}/${env}/client_table
file_format: parquet
save_args:
mode: overwrite
Now we're running on databricks and they offer many optimisations such as autoOptimizeShuffle
. We are considering to make use of this to handle our 15TB+ datasets.
However, it's not clear to me how to use kedro with the databricks delta lake solution
Upvotes: 1
Views: 738
Reputation: 1516
Kedro now has a native dataset, see the docs here: https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#spark-and-delta-lake-interaction
temperature:
type: spark.SparkDataSet
filepath: data/01_raw/data.csv
file_format: "csv"
load_args:
header: True
inferSchema: True
save_args:
sep: '|'
header: True
weather@spark:
type: spark.SparkDataSet
filepath: s3a://my_bucket/03_primary/weather
file_format: "delta"
save_args:
mode: "overwrite"
versionAsOf: 0
weather@delta:
type: spark.DeltaTableDataSet
filepath: s3a://my_bucket/03_primary/weather
Upvotes: 1
Reputation: 31
Its worked for us.
client_table:
type: kedro.contrib.io.pyspark.SparkDataSet
filepath: ${base_path_spark}/${env}/client_table
file_format: "delta"
save_args:
mode: overwrite
Upvotes: 2