EC2 to S3 uploads fail randomly

Question

I have a few 100 AWS Batch jobs that run on AWS EC2 on-demand instances. The jobs carry out some computation to generate two parquet files A and B, and upload A and B to separate paths in the same bucket.

When I do run these AWS Batch jobs, I'll see that about 60-70% of them fail. When inspecting logs, some of these jobs will have file A get uploaded successfully, and file B fail with the following exception:

botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:

Sometimes both files A and B fail to upload with the same exception.

Files are about 200 MB parquet files.

The other 30-40% of the jobs which do succeed do not experience this network issue.

What could be the cause of this intermittent failure? How would one go about debugging this?

EDIT - I'll mark this closed. For anyone else running into this issue, this was due to the self hosted NAT that was throttling the bandwidth. I had set up too small an instance (fck-nat) that couldn't handle the 100 odd jobs that were running at the same time.

smoot · Accepted Answer

Will need some code snippets to dig further... There are some similar answers including:

Questions to ask:

Are they all in the same VPC?
Is there any difference between scripts?
Is there any data skew (where some files are much larger than others)
- If you're doing some processing on input files, and the input files are all 200Mb but the transformations create new data, those transforms might create skew in final output but idk
Are you sure they're all on-demand and not being dropped as spot instances?
Lastly are you using long term connections throughout? Like do you have the following below:

cnxn = boto3.connect()
process_data_for_a_while()
cnxn.upload(file)

If you do then maybe the cnxn is too long lived

EC2 to S3 uploads fail randomly

Answers (1)

Related Questions