How would one use databricks delta lake format with Kedro?

Question

We are using kedro in our project. Normally, one can define datasets as such:

client_table:
  type: spark.SparkDataSet
  filepath: ${base_path_spark}/${env}/client_table
  file_format: parquet
  save_args:
    mode: overwrite

Now we're running on databricks and they offer many optimisations such as autoOptimizeShuffle. We are considering to make use of this to handle our 15TB+ datasets.

However, it's not clear to me how to use kedro with the databricks delta lake solution

datajoely · Accepted Answer

Kedro now has a native dataset, see the docs here: https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#spark-and-delta-lake-interaction

temperature:
  type: spark.SparkDataSet
  filepath: data/01_raw/data.csv
  file_format: "csv"
  load_args:
    header: True
    inferSchema: True
  save_args:
    sep: '|'
    header: True

weather@spark:
  type: spark.SparkDataSet
  filepath: s3a://my_bucket/03_primary/weather
  file_format: "delta"
  save_args:
    mode: "overwrite"
    versionAsOf: 0

weather@delta:
  type: spark.DeltaTableDataSet
  filepath: s3a://my_bucket/03_primary/weather

How would one use databricks delta lake format with Kedro?

Answers (2)

Related Questions