user3858193
user3858193

Reputation: 1518

Can Google Cloud Data Catalog be used as a metadata repository for Dataproc (Spark/Hive/Presto) and also GCS files?

We are using MySQL (Cloud SQL) for the metadata repository for Dataproc. This doesn't store any pieces of information of GCS files which are not part of Hive external tables.

Can anyone suggest the best way to store all the file/data details in one catalog in Google Cloud?

Upvotes: 5

Views: 1233

Answers (2)

mesmacosta
mesmacosta

Reputation: 476

dvorzhak,

Data Catalog became GA: Data Catalog GA

And they have updated the docs for Filesets: Data Catalog Filesets

Also if you want to create Data Catalog assets for each of your cloud storage objects, you may use this open source script: datacatalog-util which has an option to create Entries for your files.

Finally there's an open source connector script, if you want to ingest Hive Databases/Tables into Data Catalog.

Upvotes: 0

Dagang Wei
Dagang Wei

Reputation: 26478

Google Cloud Data Catalog beta doesn't work with GCS or Hive Metastore. See this doc

Tagging Cloud Storage assets (for example, buckets and objects) is unavailable in the Data Catalog beta release.

But it works with BigQuery, see this quickstart example.

Upvotes: 2

Related Questions