Reputation: 2826
I'm writing an apache beam pipeline for handling big data files in python. The parameters to this pipeline are managed with argparse
and PipelineOptions
. I created a verbose parameter (to handle the log levels) in _add_argparse_args()
:
parser.add_argument('-v', '--verbosity', action='count', help='Increase output verbosity', default=0)
This argument allows to specify -v for log INFO, and -vv for log DEBUG, and -vvv for log ALL. When the parameter is not specified, only WARN and ERROR logs are displayed.
When I run this code with DirectRunner
, everything goes fine and the log levels are correctly applied based on the count of -v parameter(s). However, when I run the same code in on GCP Dataflow, the pipeline does not start (with or without -v or --verbose parameter): Dataflow parses the pipeline, does its internal logic of unzipping and fusing steps, then prepares the workers pool, prepares the first fused step but does not run it, and then waits here. After a few minutes, Dataflow notices that there's no activity and reduces the workers pool and eventually shuts down after 1 hour.
If I remove that parameter definition, the pipeline runs fine on GCP Dataflow.
So I was wondering how this parameter (not even using it, but just defining it) can interfere with GCP Dataflow.
Update: I investigated further into the dataflow logs and found this log a lot of times (maybe each time dataflow starts a worker?). It is linked to sdk_worker_main.py
:
2022-04-11 10:25:38.993 CEST "Logging handler created."
2022-04-11 10:25:38.996 CEST "semi_persistent_directory: /var/opt/google"
2022-04-11 10:25:39.024 CEST "usage: sdk_worker_main.py [-h] [--runner RUNNER] [--streaming] "
2022-04-11 10:25:39.024 CEST " [--resource_hint RESOURCE_HINTS] "
2022-04-11 10:25:39.024 CEST " [--beam_services BEAM_SERVICES] "
2022-04-11 10:25:39.024 CEST " [--type_check_strictness {ALL_REQUIRED,DEFAULT_TO_ANY}] "
2022-04-11 10:25:39.024 CEST " [--type_check_additional TYPE_CHECK_ADDITIONAL] "
...[lists all possible parameters]...
2022-04-11 10:25:39.025 CEST " [--s3_verify S3_VERIFY] [--s3_disable_ssl] "
2022-04-11 10:25:39.025 CEST " --root_path ROOT_PATH --period PERIOD [-v] "
2022-04-11 10:25:39.025 CEST "sdk_worker_main.py: error: argument -v/--verbosity: ignored explicit argument '0' "
2022-04-11 10:25:39.060 CEST "2022/04/11 08:25:39 Python exited: exit status 2 "
In the above logs, --root_path, --period and -v are the parameters that I created for the pipeline.
Upvotes: 0
Views: 273