cascading sinkmode.update notworking

Question

I just started cascading programming and have a cascading job which needs to run variable times of iteration. During each iteration, it ready from file (Tap) generated from previous iteration and write calculated data to two separate SinkTaps.

One Tap (Tap Final) is used to collect data from each iterations.
The other Tap (Tap intermediate) using to collect data that need to be calculated in the next iteration.

I am using SinkMode.UPDATE for "Tap final" to make this happen. It works correct at local mode. But failed at cluster mode. Complain about file already existed ("Tap final").

I am running CDH4.4 and cascading 2.5.2. Seems like there is no one has experienced the same problem.

If anyone knows any possible way to fix it, please let me know. Thanks

Caused by: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://dv-db.machines:8020/tmp/xxxx/cluster/97916 already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:126)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:419)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:332)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1269)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1266)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:606)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:601)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:601)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:586)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:105)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

ganesh · Accepted Answer

It would helpful to understand the issue better if you could add cascading flow code to your question.

It seems the job file with same name is being used between different jobs on cluster mode. One simple solution in case you are fine to not run it concurrently would be be set max concurrent steps to 1.

Flow flow = flowConnector.connect("name", sources, sinks, outPipe1, outPipe2);
flow.setMaxConcurrentSteps(jobProperties, 1);

cascading sinkmode.update notworking

Answers (2)

Related Questions