pascalwhoop
pascalwhoop

Reputation: 3101

How would one use databricks delta lake format with Kedro?

We are using kedro in our project. Normally, one can define datasets as such:

client_table:
  type: spark.SparkDataSet
  filepath: ${base_path_spark}/${env}/client_table
  file_format: parquet
  save_args:
    mode: overwrite

Now we're running on databricks and they offer many optimisations such as autoOptimizeShuffle. We are considering to make use of this to handle our 15TB+ datasets.

However, it's not clear to me how to use kedro with the databricks delta lake solution

Upvotes: 1

Views: 738

Answers (2)

datajoely
datajoely

Reputation: 1516

Kedro now has a native dataset, see the docs here: https://kedro.readthedocs.io/en/stable/tools_integration/pyspark.html#spark-and-delta-lake-interaction

temperature:
  type: spark.SparkDataSet
  filepath: data/01_raw/data.csv
  file_format: "csv"
  load_args:
    header: True
    inferSchema: True
  save_args:
    sep: '|'
    header: True

weather@spark:
  type: spark.SparkDataSet
  filepath: s3a://my_bucket/03_primary/weather
  file_format: "delta"
  save_args:
    mode: "overwrite"
    versionAsOf: 0

weather@delta:
  type: spark.DeltaTableDataSet
  filepath: s3a://my_bucket/03_primary/weather

Upvotes: 1

jovib
jovib

Reputation: 31

Its worked for us.

    client_table:
      type: kedro.contrib.io.pyspark.SparkDataSet
      filepath: ${base_path_spark}/${env}/client_table
      file_format: "delta"
      save_args:
        mode: overwrite

Upvotes: 2

Related Questions