Reputation: 773
I am following this tutorial on migrating data from an oracle database to a Cloud SQL PostreSQL instance.
I am using the Google Provided Streaming Template Datastream to PostgreSQL
At a high level this is what is expected:
When the Avro files are uploaded into the Cloud Storage location, the job is indeed triggered but when I check the target PostgreSQL database the required data has not been populated.
When I check the job logs and worker logs, there are no error logs. When the job is triggered these are the logs that logged:
StartBundle: 4
Matched 1 files for pattern gs://BUCKETNAME/ora2pg/DEMOAPP_DEMOTABLE/2022/01/11/20/03/7e13ac05aa3921875434e51c0c0c63aaabced31a_oracle-backfill_336860711_1_0.avro
FinishBundle: 5
Does anyone know what the issue is? Is it a configuration issue? If needed I will post the required configurations.
If not could someone aid me on how to properly debug this particular Dataflow job? Thanks
EDIT 1:
When checking the step info for the steps in the pipeline, found the following:
Below are all the steps in the pipeline:
First step (DatastreamIO) seems to work as expected with the correct number of element counters in the "Output collection" which is 2.
However in the second step, these 2 element counters are not found in the Output collection. On further inspection, it can be seen that the elements seem to be dropped in the following step (Format to Postgres DML > Format to Postgres DML > Map):
EDIT 2:
This is a screenshot of the Cloud Worker logs for the above step:
EDIT 3:
I individually built and deployed the template from source in order to debug this issue. I found that the code works up to the following line in DatabaseMigrationUtils.java
:
return KV.of(jsonString, dmlInfo);
Where the jsonString
variable contains the dataset read from the .avro
file.
But the code does not progress beyond this and seems to abruptly stop without any errors being thrown.
Upvotes: 6
Views: 612
Reputation: 773
This answer is accurate as of 19th January 2022.
Upon manual debug of this dataflow, I found that the issue is due to the dataflow job is looking for a schema with the exact same name as the value passed for the parameter databaseName
and there was no other input parameter for the job using which we could pass a schema name. Therefore for this job to work, the tables will have to be created/imported into a schema with the same name as the database.
However, as @Iñigo González said this dataflow is currently in Beta and seems to have some bugs as I ran into another issue as soon as this was resolved which required me having to change the source code of the dataflow template job itself and build a custom docker image for it.
Upvotes: 1