Mel-BR
Mel-BR

Reputation: 23

Flink : cannot cancel a running job (streaming)

I want to run a streaming job.
When I try to run it locally using start-clusted.sh and the Flink Web Interface, I have no problem.

However, I am currently trying to run my job using Flink on YARN (deployed on Google Dataproc) and when I try to cancel it, the canceling state lasts forever and a slot remains occupied in the TaskManager.

Here is the log I got :

2016-10-18 16:56:04,053 INFO org.apache.flink.runtime.taskmanager.Task - 
Attempting to cancel task Source: pubSubMessageAcknowledgingSource -> 
TrackingDisplayPushDeduplicater -> TrackingDisplayPushDeserializer -> 
(Sink: TrackingDisplayPushErrorFlumeSink, Map -> Sink: 
TrackingDisplayPushValidFlumeSink) (1/1)
2016-10-18 16:56:04,053 INFO org.apache.flink.runtime.taskmanager.Task - 
Source: pubSubMessageAcknowledgingSource -> 
TrackingDisplayPushDeduplicater -> TrackingDisplayPushDeserializer -> 
(Sink: TrackingDisplayPushErrorFlumeSink, Map -> Sink: 
TrackingDisplayPushValidFlumeSink) (1/1) switched to CANCELING
2016-10-18 16:56:04,053 INFO org.apache.flink.runtime.taskmanager.Task - 
Triggering cancellation of task code Source: 
pubSubMessageAcknowledgingSource -> TrackingDisplayPushDeduplicater -> 
TrackingDisplayPushDeserializer -> (Sink: 
TrackingDisplayPushErrorFlumeSink, Map -> Sink: 
TrackingDisplayPushValidFlumeSink) (1/1) (38bf32d9199a0c9383a8b1e8d73a1f65).
2016-10-18 16:56:34,055 WARN org.apache.flink.runtime.taskmanager.Task - 
Task 'Source: pubSubMessageAcknowledgingSource -> 
TrackingDisplayPushDeduplicater -> TrackingDisplayPushDeserializer -> 
(Sink: TrackingDisplayPushErrorFlumeSink, Map -> Sink: 
TrackingDisplayPushValidFlumeSink) (1/1)' did not react to cancelling 
signal, but is stuck in method:
java.net.PlainSocketImpl.socketConnect(Native Method)
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
java.net.Socket.connect(Socket.java:589)
java.net.Socket.connect(Socket.java:538)
sun.net.NetworkClient.doConnect(NetworkClient.java:180)
sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
sun.net.www.http.HttpClient.New(HttpClient.java:308)
sun.net.www.http.HttpClient.New(HttpClient.java:326)
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1283)
sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1258)
com.accengage.bigdata.flink.streaming.sinks.FlumeSink.flush(FlumeSink.java:107)
com.accengage.bigdata.flink.streaming.sinks.FlumeSink.invoke(FlumeSink.java:80)
com.accengage.bigdata.flink.streaming.sinks.FlumeSink.invoke(FlumeSink.java:25)l
org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:39)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:373)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:358)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:346)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:329)
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:373)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:358)
org.apache.flink.streaming.api.collector.selector.DirectedOutput.collect(DirectedOutput.java:126)
org.apache.flink.streaming.api.collector.selector.DirectedOutput.collect(DirectedOutput.java:35)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:346)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:329)
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:373)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:358)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:346)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:329)
org.apache.flink.streaming.api.operators.StreamFilter.processElement(StreamFilter.java:38)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:373)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:358)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:346)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:329)
org.apache.flink.streaming.api.operators.StreamSource$NonTimestampContext.collect(StreamSource.java:160)
com.accengage.bigdata.flink.streaming.sources.PubSubAcknowledgingSource.run(PubSubAcknowledgingSource.java:148)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:80)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:53)
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:56)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:266)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
java.lang.Thread.run(Thread.java:745)

Any idea of what I am doing wrong?
What could I do?

Thanks.

Upvotes: 2

Views: 3091

Answers (1)

Robert Metzger
Robert Metzger

Reputation: 4542

I assume you are using a custom Sink (com.accengage.bigdata.flink.streaming.sinks.FlumeSink) which uses some HTTP library for communicating with Flume.

Most likely, the HTTP library got struck in a loop or something when the interrupt was send to the thread (this happens for example when Interrupted exceptions are ignored)

To resolve the issue, you can either use a HTTP library which handles interrupts properly or call the library from a different thread, which will not receive the interrupts on the main thread.

In Flink 1.2 there will be some additional mechanism to avoid the system to get struck in the cancel() call. See FLINK-4715.

Upvotes: 3

Related Questions