Reputation: 7
Situation:
I read 25M+ rows, ~300 columns from BigQuery and write to SQL Server with JDBC and it takes too much time. When I look into for figuring out which step takes the most time inefficiently, I encounter with SDFBoundedSourceReader
. SDFBoundedSourceReader
step gets elements one by one and this increases pipeline elapsed time as well as requires of using a lot of vCPUs and get many errors like
Error message from worker: Error encountered with the status channel: SDK harness sdk-0-0 disconnected.
Operation ongoing in bundle process_bundle-3484514982132990920-35 for at least 12m39s without outputting or completing:
Completed work item 7528337809130607698 UNSUCCESSFULLY: CANCELLED: [type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.StackTraceProto] { stack_top_loc { filepath: .... [dist_proc.dax.workflow.workflow_utils_message_ext]: WORK_PROGRESS_UPDATE_LEASE_ALREADY_CANCELLED }']
I tried:
pre_optimize=all
ReadFromBigQuery
configurations:
EXPORT
DIRECT_READ
BEAM_ROW
PYTHON_DICT
and giving schema I created (..NamedTuple
)I'd prefer getting output as BEAM_ROW because giving schema also takes much time because of ~300 columns however if you have any idea to get better performance, welcome.Please check images from dataflow below.
Upvotes: 0
Views: 33