Jacek Laskowski
Jacek Laskowski

Reputation: 74629

How to use Google Cloud Storage as a storage layer of Delta Lake?

Can I use Google Cloud Storage as a storage layer of Delta Lake?


Found on slack.

Upvotes: 4

Views: 3270

Answers (2)

Oliver Brylle Majaba
Oliver Brylle Majaba

Reputation: 41

It's possible. Here's a sample code and the libraries that you need:

Make sure to set first your credential, you can either part of the code or as environment:

export GOOGLE_APPLICATION_CREDENTIALS={gcs-key-path.json}
import org.apache.spark.sql.{SparkSession, DataFrame}
import com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException
import com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryOptions
import com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.DatasetInfo

spark.conf.set("parentProject", {Proj})
spark.conf.set("spark.hadoop.fs.gs.auth.service.account.enable", "true")   
spark.conf.set("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark.conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
spark.conf.set("spark.delta.logStore.gs.impl", "io.delta.storage.GCSLogStore")
spark.conf.set("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")


val targetTablePath = "gs://{bucket}/{dataset}/{tablename}"
spark.range(5, 10).write.format("delta")
      .mode("overwrite")
      .save(targetTablePath)

Libraries that you need:

"io.delta" % "delta-core_2.12" % "1.0.0",
"io.delta" % "delta-contribs_2.12" % "1.0.0",
"com.google.cloud.spark" % "spark-bigquery-with-dependencies_2.12" % "0.21.1",
"com.google.cloud.bigdataoss" % "gcs-connector" % "1.9.4-hadoop3"

Checking my delta files in GCS:

$ gsutil ls gs://r-dps-datapipeline-dev/testoliver/oliver_sample_delta3
gs://r-dps-datapipeline-dev/testoliver/oliver_sample_delta3/
gs://r-dps-datapipeline-dev/testoliver/oliver_sample_delta3/part-00000-ce79bfc7-e28f-4929-955c-56a7a08caf9f-c000.snappy.parquet
gs://r-dps-datapipeline-dev/testoliver/oliver_sample_delta3/part-00001-dda0bd2d-a081-4444-8983-ac8f3a2ffe9d-c000.snappy.parquet
gs://r-dps-datapipeline-dev/testoliver/oliver_sample_delta3/part-00002-93f7429b-777a-42f4-b2dd-adc9a482a6e8-c000.snappy.parquet
gs://r-dps-datapipeline-dev/testoliver/oliver_sample_delta3/part-00003-e9874baf-6c0b-46de-891e-032ac8b67287-c000.snappy.parquet
gs://r-dps-datapipeline-dev/testoliver/oliver_sample_delta3/part-00004-ede54816-2da1-412f-a9e3-5233e77258fb-c000.snappy.parquet
gs://r-dps-datapipeline-dev/testoliver/oliver_sample_delta3/_delta_log/
gs://r-dps-datapipeline-dev/testoliver/oliver_sample_delta3/_symlink_format_manifest/

Upvotes: 3

Jacek Laskowski
Jacek Laskowski

Reputation: 74629

It is not possible in Delta Lake up to and including 0.5.0.

There's an issue to track this at https://github.com/delta-io/delta/issues/294. Feel free to upvote that to help get it prioritized.


Just a day after Google posted Getting started with new table formats on Dataproc:

We’re announcing that table format projects Delta Lake and Apache Iceberg (Incubating) are now available in the latest version of Cloud Dataproc (version 1.5 Preview). You can start using them today with either Spark or Presto. Apache Hudi is also available on Dataproc 1.3.

Upvotes: 3

Related Questions