Reputation: 13
I just started cascading programming and have a cascading job which needs to run variable times of iteration. During each iteration, it ready from file (Tap) generated from previous iteration and write calculated data to two separate SinkTaps.
I am using SinkMode.UPDATE
for "Tap final" to make this happen. It works correct at local mode. But failed at cluster mode. Complain about file already existed ("Tap final").
I am running CDH4.4 and cascading 2.5.2. Seems like there is no one has experienced the same problem.
If anyone knows any possible way to fix it, please let me know. Thanks
Caused by: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://dv-db.machines:8020/tmp/xxxx/cluster/97916 already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:126)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:419)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:332)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1269)
at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1266)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:606)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:601)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:601)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:586)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:105)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:196)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:149)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:124)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:43)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Upvotes: 1
Views: 179
Reputation: 269
UPDATE only works with sinks (like databases) that support in-place updating.
If you're using Hfs (a file system sink) then you'll need to use SinkMode.REPLACE.
Upvotes: 0
Reputation: 220
It would helpful to understand the issue better if you could add cascading flow code to your question.
It seems the job file with same name is being used between different jobs on cluster mode. One simple solution in case you are fine to not run it concurrently would be be set max concurrent steps to 1.
Flow flow = flowConnector.connect("name", sources, sinks, outPipe1, outPipe2);
flow.setMaxConcurrentSteps(jobProperties, 1);
Upvotes: 0