Reputation: 3858
One of our Spark application frequently ran into kerberos authentication error on a Hadoop cluster. Initially we believed it to be caused by a misconfigured delegation token renewal policy. But later we found the following message in the Spark driver log:
22/01/15 02:13:38 INFO YARNHadoopDelegationTokenManager: Attempting to login to KDC using principal: XXX/[email protected]
22/01/15 02:13:38 INFO YARNHadoopDelegationTokenManager: Successfully logged into KDC.
22/01/15 02:13:38 INFO HadoopFSDelegationTokenProvider: getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-206445389_46, ugi=XXX/[email protected] (auth:KERBEROS)]] with renewer yarn/[email protected]
22/01/15 02:13:38 INFO YARNHadoopDelegationTokenManager: Scheduling renewal in 3.7 min.
22/01/15 02:13:38 INFO YARNHadoopDelegationTokenManager: Updating delegation tokens.
22/01/15 02:13:38 INFO SparkHadoopUtil: Updating delegation tokens for current user.
22/01/15 02:17:23 INFO YARNHadoopDelegationTokenManager: Attempting to login to KDC using principal: XXX/[email protected]
22/01/15 02:17:23 INFO YARNHadoopDelegationTokenManager: Successfully logged into KDC.
22/01/15 02:17:23 INFO HadoopFSDelegationTokenProvider: getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1775743108_46, ugi=XXX/[email protected] (auth:KERBEROS)]] with renewer yarn/[email protected]
22/01/15 02:17:23 INFO YARNHadoopDelegationTokenManager: Scheduling renewal in 3.7 min.
22/01/15 02:17:23 INFO YARNHadoopDelegationTokenManager: Updating delegation tokens.
22/01/15 02:17:23 INFO SparkHadoopUtil: Updating delegation tokens for current user.
22/01/15 02:17:28 ERROR DriverLogger$DfsAsyncWriter: Failed writing driver logs to dfs
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (token for XXX: HDFS_DELEGATION_TOKEN owner=XXX/[email protected], renewer=yarn, realUser=, issueDate=1642212592939, maxDate=1642212892939, sequenceNumber=25145, masterKeyId=237) is expired, current time: 2022-01-15 02:17:28,048+0000 expected renewal time: 2022-01-15 02:14:52,939+0000
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1508)
at org.apache.hadoop.ipc.Client.call(Client.java:1454)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy12.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:497)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy13.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1085)/* w */
at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1865)
at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1668)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
22/01/15 02:17:33 ERROR DriverLogger$DfsAsyncWriter: Failed writing driver logs to dfs
So the error of DriverLogger was triggered only 5 seconds after a successful renewal (at which point the latest token couldn't possibly expire), for the same hadoop user. So the only possibility seems to be that the DriverLogger tried to write into HDFS using an obsolete delegation token.
How could I confirm this hypothesis and how to fix it?
UPDATE 1 The above log is from launching a Spark ThriftServer in a YARN environment (using ./sbin/start-thriftserver). The strange thing is that if I submit a normal application (e.g. the SparkPi example), the problem disappeared even after prolonged execution.
So part of the Spark ThriftServer could be written with errors that causes the token used by DriverLogger to be out of sync. I just don't know which part. So the question becomes: what are possible ways to ensure that the token being renewed and the token being used for the HDFS writing are the same token?
Upvotes: 0
Views: 924