Debugging a Google Dataflow Streaming Job that does not work expected

Question

I am following this tutorial on migrating data from an oracle database to a Cloud SQL PostreSQL instance.

I am using the Google Provided Streaming Template Datastream to PostgreSQL

At a high level this is what is expected:

Datastream exports in Avro format backfill and changed data into the specified Cloud Bucket location from the source Oracle database
This triggers the Dataflow job to pickup the Avro files from this cloud storage location and insert into PostgreSQL instance.

When the Avro files are uploaded into the Cloud Storage location, the job is indeed triggered but when I check the target PostgreSQL database the required data has not been populated.

When I check the job logs and worker logs, there are no error logs. When the job is triggered these are the logs that logged:

StartBundle: 4
Matched 1 files for pattern gs://BUCKETNAME/ora2pg/DEMOAPP_DEMOTABLE/2022/01/11/20/03/7e13ac05aa3921875434e51c0c0c63aaabced31a_oracle-backfill_336860711_1_0.avro
FinishBundle: 5

Does anyone know what the issue is? Is it a configuration issue? If needed I will post the required configurations.

If not could someone aid me on how to properly debug this particular Dataflow job? Thanks

EDIT 1:

When checking the step info for the steps in the pipeline, found the following:

Below are all the steps in the pipeline:

First step (DatastreamIO) seems to work as expected with the correct number of element counters in the "Output collection" which is 2.

However in the second step, these 2 element counters are not found in the Output collection. On further inspection, it can be seen that the elements seem to be dropped in the following step (Format to Postgres DML > Format to Postgres DML > Map):

EDIT 2:

This is a screenshot of the Cloud Worker logs for the above step:

EDIT 3:

I individually built and deployed the template from source in order to debug this issue. I found that the code works up to the following line in DatabaseMigrationUtils.java:

return KV.of(jsonString, dmlInfo);

Where the jsonString variable contains the dataset read from the .avro file.

But the code does not progress beyond this and seems to abruptly stop without any errors being thrown.

user538578964 · Accepted Answer

This answer is accurate as of 19th January 2022.

Upon manual debug of this dataflow, I found that the issue is due to the dataflow job is looking for a schema with the exact same name as the value passed for the parameter databaseName and there was no other input parameter for the job using which we could pass a schema name. Therefore for this job to work, the tables will have to be created/imported into a schema with the same name as the database.

However, as @Iñigo González said this dataflow is currently in Beta and seems to have some bugs as I ran into another issue as soon as this was resolved which required me having to change the source code of the dataflow template job itself and build a custom docker image for it.

Debugging a Google Dataflow Streaming Job that does not work expected

Answers (1)

Related Questions