midnight1247
midnight1247

Reputation: 366

Hadoop writes incomplete file to HDFS

I have a logging system writing 5-minute aggregates to Hadoop, over ~60GB of data each hour. My problem comes when, reviewing the data, some peak hours seems incomplete (Only part of the logs are writed). This loss of information coincide with the highest load periods. I'm following the logs and reading every trace I can, but I can't find out what is exactly the problem.

The batch process gets log lines of many servers from Kafka, and then queue that lines to writer threads, who open files on HDFS and write in all the lines (one file per source server). With low to medium loads, all it's OK. When there is high workloads, the logs start to spit some errors and warns:

    2016-10-08 14:16:24 INFO  Exception in createBlockOutputStream
java.io.IOException: Got error, status message , ack with firstBadLink as <dnip>:50010
        at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:140)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1397)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1299)
        at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:464)
2016-10-08 14:16:24 INFO  Abandoning BP-1891784736-10.118.249.120-1467026091136:blk_1077950482_131390250
2016-10-08 14:16:24 INFO  Excluding datanode DatanodeInfoWithStorage[<dnip>:50010,DS-d1b79bfe-3ee8-4dba-b3a3-48e50a357b30,DISK]

After some seconds, new errors appear:

2016-10-08 14:17:34 INFO  Exception in createBlockOutputStream
java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel

After which errors, the batch process starts to close files. Then, only for some files, there are some new errors:

2016-10-08 14:17:34 WARN  DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /my/log/path/log.gz could only be replicated to 0 nodes instead of minRep
lication (=1).  There are 2 datanode(s) running and 2 node(s) are excluded in this operation.

For that specific files, size is cero. Otherwise, they are only written partially, losing some data. Also, HDFS logs state that:

2016-10-08 14:21:22,789 WARN  datanode.DataNode (BlockReceiver.java:receivePacket(694)) - Slow BlockReceiver write data to disk cost:2143ms (threshold=300ms)

and

DataXceiver error processing WRITE_BLOCK operation

Viewing all the logs, the WARN on HDFS seems correlated to the loss of information, and also to createBlockOutputStream. Whenever there are lots of lines with that errors, there is data loss.

Any logs I should check? Maybe Hadoop tunning?

Upvotes: 0

Views: 4769

Answers (1)

midnight1247
midnight1247

Reputation: 366

As a partial answer, we found that in the worker nodes, the GC was causing lots of long pauses (3~5 secs) every six hours (the predefined GC span). We increased the heap from 1GB to 4GB and seems solved. What is causing the heap going constantly full is still an open question, but is beyond the scope of this. After the heap increase, there are no more errors (related to this) in the logs.

Upvotes: 2

Related Questions