AGS
AGS

Reputation: 427

EC2 to S3 uploads fail randomly

I have a few 100 AWS Batch jobs that run on AWS EC2 on-demand instances. The jobs carry out some computation to generate two parquet files A and B, and upload A and B to separate paths in the same bucket.

When I do run these AWS Batch jobs, I'll see that about 60-70% of them fail. When inspecting logs, some of these jobs will have file A get uploaded successfully, and file B fail with the following exception:

botocore.exceptions.ConnectionClosedError: Connection was closed before we received a valid response from endpoint URL:

Sometimes both files A and B fail to upload with the same exception.

Files are about 200 MB parquet files.

The other 30-40% of the jobs which do succeed do not experience this network issue.

What could be the cause of this intermittent failure? How would one go about debugging this?

EDIT - I'll mark this closed. For anyone else running into this issue, this was due to the self hosted NAT that was throttling the bandwidth. I had set up too small an instance (fck-nat) that couldn't handle the 100 odd jobs that were running at the same time.

Upvotes: -3

Views: 55

Answers (1)

smoot
smoot

Reputation: 322

Will need some code snippets to dig further... There are some similar answers including:

Questions to ask:

  • Are they all in the same VPC?
  • Is there any difference between scripts?
  • Is there any data skew (where some files are much larger than others)
    • If you're doing some processing on input files, and the input files are all 200Mb but the transformations create new data, those transforms might create skew in final output but idk
  • Are you sure they're all on-demand and not being dropped as spot instances?
  • Lastly are you using long term connections throughout? Like do you have the following below:
cnxn = boto3.connect()
process_data_for_a_while()
cnxn.upload(file)

If you do then maybe the cnxn is too long lived

Upvotes: -1

Related Questions