Reputation: 105
I have many qsns inside this situation. So here goes :
Has anyone ever written Kafka's output to a Google Cloud Storage (GCS) bucket, such that the data in that bucket is partitioned using the "default hive partitioning layout" The intent behind doing that is this external table needs to be "queryable" in BigQuery Google's documentation on that is here but wanted to see if someone has an example ( https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs )
for e.g. the documentation says "files follow the default layout, with the key/value pairs laid out as directories with an = sign as a separator, and the partition keys are always in the same order."
What's not clear is a) does Kafka create these directories on the fly OR do i have to pre-create them ? Lets say i WANT to have KAFKA write to directories based on date in GCS
gs://bucket/table/dt=2020-04-07/
Tonight, after midnight, do i have PRE-create this new directory gs://bucket/table/dt=2020-04-08/ or CAN Kafka create it for me AND in all this, how does hive partitioning LAYOUT help me ?
Does my table's data, which i am trying to put in these dirs every day, need to have "dt" ( from gs://bucket/table/dt=2020-04-07/ ) as a column in it ?
Since the goal in all this to have BigQuery query this external table, which underlying is referencing all data in this bucket i.e.
gs://bucket/table/dt=2020-04-06/
gs://bucket/table/dt=2020-04-07/
gs://bucket/table/dt=2020-04-08/
Just trying to see if this would be the right approach for it.
Upvotes: 1
Views: 2404
Reputation: 2099
Kafka itself is a messaging system that allows to exchange data between processes, applications, and servers, but it requires producers and consumers (here is an example) that move the data. For instance:
The Producer needs to send the data in a format that BigQuery can read.
And the Consumer needs to write the data with a valid Hive Layout.
The Consumer should write to GCS, so you would need to find the proper connector for your application (e.g. this Java connector or Confluent connector). And when writing the messages to GCS you need to take care about using a valid 'default hive partitioning layout'.
For example, gs://bucket/table/dt=2020-04-07/
, dt
is a column where the table is partitioned on, and 2020-04-07
is one of its values, so take care about this. Once you have a valid Hive Layout in GCS, you need to create a table in BigQuery, I recommend a native table from the UI and selecting Google Cloud Storage as the source and enabling 'Source Data Partitioned', but you can also use --hive_partitioning_source_uri_prefix and --hive_partitioning_mode to link the GCS data with a BigQuery table.
As all this process implies different layers of development and configuration, if this process makes sense for you, I recommend you open new questions for any specific errors you could have.
The last but not least, Kafka to BigQuery connector and other connectors to ingest from Kafka to GCP can help better if Hive Layout is not mandatory for your use case.
Upvotes: 1