Dolan Antenucci
Dolan Antenucci

Reputation: 15942

Why would an AWS EMR job with c3.8xlarge servers have serious lag vs the same job with cc2.8xlarge servers?

I suspect this may be a matter of something internal on AWS's end, but figured post on here since I don't have Premium AWS Support currently (update: signed up for AWS support, so hopefully I can get an answer from them).

I have a recurring EMR job that I recently switched from using cc2.8xlarge servers to c3.8xlarge servers. On my first run with the new configuration, one of my map-reduce jobs that normally takes 2-3 minutes was stuck spending over 9 hours copying the data from the mappers to the sole reducer. I killed the job after 9.5 hours, retried starting the job on a new EMR cluster, and I saw the same behavior in the first hour so killed it again. When I switched my job back to using cc2.8xlarge servers, the job finished in 2-3 minutes.

I checked AWS's Health Dashboard, but no errors were showing up. The c3.8xlarge servers should be equal or faster on all accounts over the cc2.8xlarge (more CPU, uses SSDs, etc). It looks like all the clusters were on us-east-1a.

Has anyone run into similar issues? Any ideas on how to debug further?

Upvotes: 1

Views: 660

Answers (2)

Dolan Antenucci
Dolan Antenucci

Reputation: 15942

I mentioned two issues above (the second in my comment). Here's the solution to my first issue (reducers stuck during copy-phase):

A bit late of an update, but I did hear back from AWS Support. The problem was related to a bug they fixed in an newer AMI version than what I was using.

A word of warning: I was using Boto with AMI = 'latest', but this was not actually giving me the latest version. Instead of using AMI v3.3.0 (latest as of Oct 2015), it was using AMI v2.4.2.

Here's the full reply from AWS Support describing the bug and fix:

I'm sorry for the delay. I was able to take a look at all 3 clusters that you provided. I can see repeated “Too many fetch-failures” in the step error log.

The reason is that multiple reducers attempting to fetch from single tasktracker for map output, failing to retrieve output and eventually failing each map attempt task with Too many fetch failure error. Hadoop recovers by rescheduling map on another tasktracker. May cause long delay in processing if many maps executed on the tasktracker having troubles providing output.

I can see that you have specified AMI version 2.4.2, in which this is known as a jetty bug:

https://issues.apache.org/jira/browse/MAPREDUCE-2980

The occurrence of this issue is intermittent as we know.

According to this link:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html

AMI versions later than 2.4.5 include fixes for this bug.

I suggest upgrading to our latest AMI version - 2.4.8 for your future jobs which should take care of this issue.

Here's the solution to my second issue (s3-dist-cp job failing with m1.xlarge servers):

The problem was I simply didn't have enough DFS space for my s3-dist-cp job to complete. I switched to a server type with more storage space, and the job completed without issue.

Here's the full response from AWS support:

In reviewing the failed clusters, I can see the following repeated in the failed reduce task attempts:

2014-10-20 13:00:21,489 WARN org.apache.hadoop.hdfs.DFSClient (DataStreamer for file /tmp/51d16df5-4acd-42ca-86c8-ba21960b7567/tempspace/45e2055a-f3f7-40a1-b36f-91deabb5db511ca8b8e3-a3a2-417f-b3da-ff17b4bd4af8): DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/51d16df5-4acd-42ca-86c8-ba21960b7567/tempspace/45e2055a-f3f7-40a1-b36f-91deabb5db511ca8b8e3-a3a2-417f-b3da-ff17b4bd4af8 could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1569) ...

And checking the instance state log on the master node, I also found the HDFS usage on each of the data nodes is pretty high. DFS Used% on 5 of the 7 data nodes is above 90%.

If you take a look at this document:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

"During a copy operation, S3DistCp stages a temporary copy of the output in HDFS on the cluster. There must be sufficient free space in HDFS to stage the data, otherwise the copy operation fails. In addition, if S3DistCp fails, it does not clean the temporary HDFS directory, therefore you must manually purge the temporary files. For example, if you copy 500 GB of data from HDFS to S3, S3DistCp copies the entire 500 GB into a temporary directory in HDFS, then uploads the data to Amazon S3 from the temporary directory. When the copy is complete, S3DistCp removes the files from the temporary directory. If you only have 250 GB of space remaining in HDFS prior to the copy, the copy operation fails."

So, HDFS filesystem running out of space seems the cause of this issue. To make sure S3distCP can work successfully, please make sure there is at least 50% HDFS space left. Otherwise if the copy is unsuccessful, temporary files will also take the HDFS space and will not be cleaned automatically.

Upvotes: 1

Memos
Memos

Reputation: 464

There are 2 differences between c3.8large and cc2.8xlarge that might cause issues:

  1. c3.8xlarge machines have much less disk space (2.8 TB less). I believe howeverm that this doesn't seem to be your problem.
  2. c3.8xlarge have fewer memory allocated for mapreduce tasks (the default configuration).

Check here for verification if you use Hadoop 2.0 or here if you use Hadoop 1.0

In case you use Hadoop 1.0 as you can see in the link provided, the number of mappers and reducers is much higher (by default) for c3.8xlarge instances. This means that less memory is allocated for each map an reduce task (since both instance types have more or less the same memory)

The way you describe the problem it sounds like your job runs out of memory and therefore starts using the disk instead. This can be explained from the second issue I listed above.

@Dolan Antenucci: *Now as for the m1.xlarge vs m3.xlarge issue, we also faced the same problem in some of our I/O-bounded emr jobs. We had concluded that the reason behind this is that m3.xlarge instances have a much smaller disk space than their m1.xlarge counterparts (1.6 TB less). So in our case the error we were getting was "Out of space error" of some sort. It could be useful for you to check whether you get the same type of error as well.

Upvotes: 1

Related Questions