Reputation: 366
I have a logging system writing 5-minute aggregates to Hadoop, over ~60GB of data each hour. My problem comes when, reviewing the data, some peak hours seems incomplete (Only part of the logs are writed). This loss of information coincide with the highest load periods. I'm following the logs and reading every trace I can, but I can't find out what is exactly the problem.
The batch process gets log lines of many servers from Kafka, and then queue that lines to writer threads, who open files on HDFS and write in all the lines (one file per source server). With low to medium loads, all it's OK. When there is high workloads, the logs start to spit some errors and warns:
2016-10-08 14:16:24 INFO Exception in createBlockOutputStream
java.io.IOException: Got error, status message , ack with firstBadLink as <dnip>:50010
at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:140)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1397)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1299)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)
2016-10-08 14:16:24 INFO Abandoning BP-1891784736-10.118.249.120-1467026091136:blk_1077950482_131390250
2016-10-08 14:16:24 INFO Excluding datanode DatanodeInfoWithStorage[<dnip>:50010,DS-d1b79bfe-3ee8-4dba-b3a3-48e50a357b30,DISK]
After some seconds, new errors appear:
2016-10-08 14:17:34 INFO Exception in createBlockOutputStream
java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel
After which errors, the batch process starts to close files. Then, only for some files, there are some new errors:
2016-10-08 14:17:34 WARN DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /my/log/path/log.gz could only be replicated to 0 nodes instead of minRep
lication (=1). There are 2 datanode(s) running and 2 node(s) are excluded in this operation.
For that specific files, size is cero. Otherwise, they are only written partially, losing some data. Also, HDFS logs state that:
2016-10-08 14:21:22,789 WARN datanode.DataNode (BlockReceiver.java:receivePacket(694)) - Slow BlockReceiver write data to disk cost:2143ms (threshold=300ms)
and
DataXceiver error processing WRITE_BLOCK operation
Viewing all the logs, the WARN on HDFS seems correlated to the loss of information, and also to createBlockOutputStream. Whenever there are lots of lines with that errors, there is data loss.
Any logs I should check? Maybe Hadoop tunning?
Upvotes: 0
Views: 4769
Reputation: 366
As a partial answer, we found that in the worker nodes, the GC was causing lots of long pauses (3~5 secs) every six hours (the predefined GC span). We increased the heap from 1GB to 4GB and seems solved. What is causing the heap going constantly full is still an open question, but is beyond the scope of this. After the heap increase, there are no more errors (related to this) in the logs.
Upvotes: 2