Redhill
Redhill

Reputation: 7

How to profit from/configureSDFBoundedSourceReader while reading from BigQuery in Python?

Situation: I read 25M+ rows, ~300 columns from BigQuery and write to SQL Server with JDBC and it takes too much time. When I look into for figuring out which step takes the most time inefficiently, I encounter with SDFBoundedSourceReader. SDFBoundedSourceReader step gets elements one by one and this increases pipeline elapsed time as well as requires of using a lot of vCPUs and get many errors like

Error message from worker: Error encountered with the status channel: SDK harness sdk-0-0 disconnected.

Operation ongoing in bundle process_bundle-3484514982132990920-35 for at least 12m39s without outputting or completing:

Completed work item 7528337809130607698 UNSUCCESSFULLY: CANCELLED: [type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.StackTraceProto] { stack_top_loc { filepath: .... [dist_proc.dax.workflow.workflow_utils_message_ext]: WORK_PROGRESS_UPDATE_LEASE_ALREADY_CANCELLED }']

I tried:

ReadFromBigQuery configurations:

I'd prefer getting output as BEAM_ROW because giving schema also takes much time because of ~300 columns however if you have any idea to get better performance, welcome.Please check images from dataflow below.

example dataflow

example dataflow2

example dataflow

Upvotes: 0

Views: 33

Answers (0)

Related Questions