sargis
sargis

Reputation: 1

Missing _metadata_uuid and _metadata_lsn in BigQuery Dataset for Datastream Pipeline

I could really use some assistance!

I've set up a pipeline to copy data from my managed SQL (PostgreSQL) on GCP to BigQuery.

I followed these guides:

Google Cloud Datastream Documentation

Above guide mentions that PG to BQ is supported. I successfully created a Dataflow template as outlined here: Datastream to BigQuery Template

The pipeline requires two BigQuery datasets: the main dataset and a staging dataset. After getting the pipeline running, I started seeing data being ingested into the staging dataset. The pipeline auto-created tables and columns and added the metadata* fields, but it did not add the _metadata_uuid and _metadata_lsn fields, which is causing the merge step to fail.

I checked the data in the GCS bucket that holds the delta files, and I can see the LSN field present in the metadata JSON.

Does anyone know why the _metadata_uuid and _metadata_lsn fields were not auto-created in the BigQuery datasets? Any help would be greatly appreciated!

Thanks in advance!

I was able to get merge step to work by manually adding _metadata_uuid and _metadata_lsn but this feels wrong. Especially if I have dozens of tables.

The JSON payloads in bucket have the following structure that contains LSN. This makes me thing that the actual dataflow (which I used of the template) is not correctly setup.

{
    "uuid": "ffaa95d1-71d2-4574-97f6-c65e00000101",
    "read_timestamp": "2024-11-20T01:23:48.963000Z",
    "source_timestamp": "2024-11-20T01:23:47.615000Z",
    "object": "public_appointments",
    "read_method": "postgresql-cdc",
    "stream_name": "projects/[HIDDEN_FOR_STACKOVER]/locations/us-west1/streams/postgres-stream",
    "schema_key": "68c2081e17f1eed002116c41312188c642bd003b",
    "sort_keys":
    [
        1732065827615,
        "C4/390118C0"
    ],
    "source_metadata":
    {
        "schema": "public",
        "table": "appointments",
        "is_deleted": false,
        "change_type": "UPDATE",
        "tx_id": 3088657,
        "lsn": "C4/390118C0",
        "primary_keys":
        [
            "id",
            "tenant_id"
        ]
    },
    "payload":
    {
        (actual data)
    }
}

The _metadata fields that were auto created are:


  "_metadata_timestamp",
  "_metadata_read_timestamp",
  "_metadata_read_method",
  "_metadata_source_type",
  "_metadata_deleted",
  "_metadata_change_type",
  "_metadata_row_id",
  "_metadata_scn",   
  "_metadata_ssn",
  "_metadata_rs_id"

And we can see that dataflow job created _metadata_scn but this is an Oracle field!

Upvotes: 0

Views: 24

Answers (0)

Related Questions