How to profit from/configureSDFBoundedSourceReader while reading from BigQuery in Python?

Question

Situation: I read 25M+ rows, ~300 columns from BigQuery and write to SQL Server with JDBC and it takes too much time. When I look into for figuring out which step takes the most time inefficiently, I encounter with SDFBoundedSourceReader. SDFBoundedSourceReader step gets elements one by one and this increases pipeline elapsed time as well as requires of using a lot of vCPUs and get many errors like

Error message from worker: Error encountered with the status channel: SDK harness sdk-0-0 disconnected.

Operation ongoing in bundle process_bundle-3484514982132990920-35 for at least 12m39s without outputting or completing:

Completed work item 7528337809130607698 UNSUCCESSFULLY: CANCELLED: [type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.StackTraceProto] { stack_top_loc { filepath: .... [dist_proc.dax.workflow.workflow_utils_message_ext]: WORK_PROGRESS_UPDATE_LEASE_ALREADY_CANCELLED }']

I tried:

adding cluster and/or partition to BigQuery table
adding experiment, pre_optimize=all
num workers: 5, 10, 15, 20 (however of course I would love to use minimum worker)
machine type: n2-standard-4, n2-standard-8, n2-standard-16
autoscaling algorithm: THROUGHPUT_BASED (always)
write batch size: 10000, 25000, 50000, 100000, 250000

ReadFromBigQuery configurations:

method: EXPORT
method: DIRECT_READ
reading from query
reading from table
output type: BEAM_ROW
output type: PYTHON_DICT and giving schema I created (..NamedTuple)

I'd prefer getting output as BEAM_ROW because giving schema also takes much time because of ~300 columns however if you have any idea to get better performance, welcome.Please check images from dataflow below.

example dataflow

example dataflow2

example dataflow

How to profit from/configureSDFBoundedSourceReader while reading from BigQuery in Python?

Answers (0)

Related Questions