Reputation: 427
I have a few 100 AWS Batch jobs that run on AWS EC2 on-demand instances. The jobs carry out some computation to generate two parquet files A and B, and upload A and B to separate paths in the same bucket.
When I do run these AWS Batch jobs, I'll see that about 60-70% of them fail. When inspecting logs, some of these jobs will have file A get uploaded successfully, and file B fail with the following exception:
botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:
Sometimes both files A and B fail to upload with the same exception.
Files are about 200 MB parquet files.
The other 30-40% of the jobs which do succeed do not experience this network issue.
What could be the cause of this intermittent failure? How would one go about debugging this?
EDIT - I'll mark this closed. For anyone else running into this issue, this was due to the self hosted NAT that was throttling the bandwidth. I had set up too small an instance (fck-nat) that couldn't handle the 100 odd jobs that were running at the same time.
Upvotes: -3
Views: 55
Reputation: 322
Will need some code snippets to dig further... There are some similar answers including:
cli-connection-timeout
which might helpQuestions to ask:
cnxn = boto3.connect()
process_data_for_a_while()
cnxn.upload(file)
If you do then maybe the cnxn
is too long lived
Upvotes: -1