stan
stan

Reputation: 105

Hadoop DistributedCache failed to report status

In a Hadoop job i am mapping several XML-files and filtering an ID for every element (from < id>-tags). Since I want to restrict the job to a certain set of IDs, I read in a large file (about 250 million lines in 2.7 GB, every line with just an integer as a ID). So I use a DistributedCache, parse the file in the setup() method of the Mapper with a BufferedReader and save the IDs to a HashSet.

Now when I start the job, I get countless

Task attempt_201201112322_0110_m_000000_1 failed to report status. Killing!

Before any map-job is executed.

The cluster consists of 40 nodes and since the files of a DistributedCache are copied to the slave nodes before any tasks for the job are executed, i assume the failure is caused by the large HashSet. I have already increased the mapred.task.timeout to 2000s. Of course I could raise the time even more, but actually this period should suffice, shouldn't it?

Since DistributedCache's are used to be a way to "distribute large, read-only files efficiently", I wondered what causes the failure here and if there is another way to pass the relevant IDs to every map-job?

Upvotes: 2

Views: 299

Answers (1)

Chris White
Chris White

Reputation: 30089

Can you add some some debug printlns to your setup method to check that it is timing out in this method (log the entry and exit times)?

You may also want to look into using a BloomFilter to hold the IDs in. You can probably store these values in a 50MB bloom filter with a good false positive rate (~0.5%), and then run a secondary job to perform a partitioned check against the actual reference file.

Upvotes: 0

Related Questions