gnomie
gnomie

Reputation: 439

Worried: Corrupt HDFS on single node - how to resolve

I am running CDH 5.6 (Hadoop 2.6, HBase 1.0.0) on a single machine. There is only Hadoop and HBase running. Hadoop is configured for replication factor 1, Hbase running on top of HDFS, no pseudo-distributed mode.

For the last three days, I had a simple program running that would insert rows into HBase using 10 threads in parallel.

Checking on it now I find that HDFS has gone corrupt and all but one inserting thread have failed.

Running

hdfs fsck / | grep CORRUPT

I see that there are some corrupted blocks.

hbase hbck

says everything if fine.

When restarting, all the sudden hdfs fsck says its HEALTHY again.

Starting the insertion gets me checksum errors again in the region server log (as below).

Finally I ran

hdfs fsck / -delete

and only after restarting everything, the insert works again.

There are some details below. The question is:

  1. How can HDFS possibly get corrupted even on a single node that is doing nothing but simply trying to insert data into Hbase? and

  2. How can I prevent that in the future?

  3. If hbase hbck said everything is OK, does that mean there was no data loss?

DETAILS:

I checked logs of region server, data node, namenode, hmaster, zookeeper. There was no out of memory or anything like that. All processes have been up all the time and kept responding until the very end. There is no disk space shortage.

Application log shows the following failure at around 05/16 08:31:32:


java.lang.RuntimeException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:
java.net.SocketTimeoutException: callTimeout=60000, callDuration=64720: row 'a899ca26-45a4-4af6-af34-72c035b4b7da/f6f88f36-9263-4dbb-9588-aaf126be9117/9223370573623181733' on table 'tt_items' at region=tt_items,a899ca26-45a4-4af6-af34-72c035b4b7da/b50993c1-7ff4-4169-b58f-f53878697709/9223370573954024736,1462900815686.08255086d13380bd559a87dd93cc15ba., hostname=hb-desktop,16201,1463231294049, seqNum=51380656
Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=36, exceptions:

java.net.SocketTimeoutException: callTimeout=60000, callDuration=64720: row 'a899ca26-45a4-4af6-af34-72c035b4b7da/f6f88f36-9263-4dbb-9588-aaf126be9117/9223370573623181733' on table 'tt_items' at region=tt_items,a899ca26-45a4-4af6-af34-72c035b4b7da/b50993c1-7ff4-4169-b58f-f53878697709/9223370573954024736,1462900815686.08255086d13380bd559a87dd93cc15ba., hostname=hb-desktop,16201,1463231294049, seqNum=51380656
Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=64720: row 'a899ca26-45a4-4af6-af34-72c035b4b7da/f6f88f36-9263-4dbb-9588-aaf126be9117/9223370573623181733' on table 'tt_items' at region=tt_items,a899ca26-45a4-4af6-af34-72c035b4b7da/b50993c1-7ff4-4169-b58f-f53878697709/9223370573954024736,1462900815686.08255086d13380bd559a87dd93cc15ba., hostname=hb-desktop,16201,1463231294049, seqNum=51380656
Caused by: java.io.IOException: java.io.IOException: Could not seek StoreFileScanner[HFileScanner for reader reader=hdfs://localhost:9000/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/9bad8381ba89430daf008c28befedec3, compression=snappy, cacheConf=blockCache=LruBlockCache{blockCount=7248, currentSize=543459280, freeSize=15851184, maxSize=559310464, heapSize=543459280, minSize=531344928, minFactor=0.95, multiSize=265672464, multiFactor=0.5, singleSize=132836232, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=a899ca26-45a4-4af6-af34-72c035b4b7da/b5099cdb-2148-4f95-866a-f6eba2813bab/9223370573623181721/d:co/1463268017037/Put, lastKey=a899ca26-45a4-4af6-af34-72c035b4b7da/fffffc52-8969-472a-a17b-65a4edd6fa66/9223370573623181744/d:r.tt.EPC/1463255430913/Put, avgKeyLen=118, avgValueLen=29, entries=20184807, length=465347831, cur=null] to key a899ca26-45a4-4af6-af34-72c035b4b7da/f6f88f36-9263-4dbb-9588-aaf126be9117/9223370573623181733/d:/LATEST_TIMESTAMP/DeleteFamily/vlen=0/seqid=0
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error: /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/9bad8381ba89430daf008c28befedec3 at 140550656 exp: -935597690 got: -1004657115
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(java.io.IOException): java.io.IOException: Could not seek StoreFileScanner[HFileScanner for reader reader=hdfs://localhost:9000/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/9bad8381ba89430daf008c28befedec3, compression=snappy, cacheConf=blockCache=LruBlockCache{blockCount=7248, currentSize=543459280, freeSize=15851184, maxSize=559310464, heapSize=543459280, minSize=531344928, minFactor=0.95, multiSize=265672464, multiFactor=0.5, singleSize=132836232, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=a899ca26-45a4-4af6-af34-72c035b4b7da/b5099cdb-2148-4f95-866a-f6eba2813bab/9223370573623181721/d:co/1463268017037/Put, lastKey=a899ca26-45a4-4af6-af34-72c035b4b7da/fffffc52-8969-472a-a17b-65a4edd6fa66/9223370573623181744/d:r.tt.EPC/1463255430913/Put, avgKeyLen=118, avgValueLen=29, entries=20184807, length=465347831, cur=null] to key a899ca26-45a4-4af6-af34-72c035b4b7da/f6f88f36-9263-4dbb-9588-aaf126be9117/9223370573623181733/d:/LATEST_TIMESTAMP/DeleteFamily/vlen=0/seqid=0

At around the same time region server says:


2016-05-16 08:30:37,390 WARN  [B.defaultRpcServer.handler=8,queue=2,port=16201] hfile.HFile: HBase checksum verification failed for file hdfs://localhost:9000/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/9bad8381ba89430daf008c28befedec3 at offset 408984102 filesize 465347831. Retrying read with HDFS checksums turned on...
2016-05-16 08:30:37,433 WARN  [B.defaultRpcServer.handler=8,queue=2,port=16201] hdfs.DFSClient: Found Checksum error for BP-130837870-192.168.178.29-1462900512452:blk_1073746899_6086 from DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK] at 140550656
2016-05-16 08:30:37,434 INFO  [B.defaultRpcServer.handler=8,queue=2,port=16201] hdfs.DFSClient: Could not obtain BP-130837870-192.168.178.29-1462900512452:blk_1073746899_6086 from any node: java.io.IOException: No live nodes contain block BP-130837870-192.168.178.29-1462900512452:blk_1073746899_6086 after checking nodes = [DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]], ignoredNodes = null No live nodes contain current block Block locations: DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK] Dead nodes:  DatanodeInfoWithStorage[127.0.0.1:50010,DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017,DISK]. Will get new block locations from namenode and retry...
2016-05-16 08:30:37,434 WARN  [B.defaultRpcServer.handler=8,queue=2,port=16201] hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 2146.2290363411184 msec.

.. some repetition...

2016-05-16 08:32:40,975 WARN  [B.defaultRpcServer.handler=26,queue=2,port=16201] hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 2767.123497826818 msec.
2016-05-16 08:32:42,496 ERROR [regionserver/hb-desktop/192.168.178.29:16201-shortCompactions-1463232518856] regionserver.CompactSplitThread: Compaction failed Request = regionName=tt_items,a899ca26-45a4-4af6-af34-72c035b4b7da/b50993c1-7ff4-4169-b58f-f53878697709/9223370573954024736,1462900815686.08255086d13380bd559a87dd93cc15ba., storeName=d, fileCount=8, fileSize=1.0 G (443.8 M, 239.2 M, 124.8 M, 101.9 M, 78.5 M, 33.0 M, 32.9 M, 10.9 M), priority=2, time=152707213570012
java.io.IOException: Could not iterate StoreFileScanner[HFileScanner for reader reader=hdfs://localhost:9000/hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/9bad8381ba89430daf008c28befedec3, compression=snappy, cacheConf=blockCache=LruBlockCache{blockCount=7130, currentSize=535445256, freeSize=23865208, maxSize=559310464, heapSize=535445256, minSize=531344928, minFactor=0.95, multiSize=265672464, multiFactor=0.5, singleSize=132836232, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false, firstKey=a899ca26-45a4-4af6-af34-72c035b4b7da/b5099cdb-2148-4f95-866a-f6eba2813bab/9223370573623181721/d:co/1463268017037/Put, lastKey=a899ca26-45a4-4af6-af34-72c035b4b7da/fffffc52-8969-472a-a17b-65a4edd6fa66/9223370573623181744/d:r.tt.EPC/1463255430913/Put, avgKeyLen=118, avgValueLen=29, entries=20184807, length=465347831, cur=a899ca26-45a4-4af6-af34-72c035b4b7da/f6f881d7-076a-4aec-82f0-17e33d8f4f84/9223370573623181728/d:pi/1463267434824/Put/vlen=73/seqid=0]
        at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:146)
        at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108)
        at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:596)
        at org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:273)
        at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:105)
        at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:124)
        at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1233)
        at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1770)
        at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:520)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error: /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/d/9bad8381ba89430daf008c28befedec3 at 140550656 exp: -935597690 got: -1004657115
        at org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSums(Native Method)
        at org.apache.hadoop.util.NativeCrc32.verifyChunkedSums(NativeCrc32.java:59)
        at org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:301)
        at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:237)
        at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:156)
        at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:744)
        at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:800)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:860)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)

Datanode says:


2016-05-16 08:26:01,750 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:46618, dest: /127.0.0.1:50010, bytes: 114196960, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1749190019_1, offset: 0, srvID: 7bfbedce-3f6d-4f95-8eb4-d96438fedc36, blockid: BP-130837870-192.168.178.29-1462900512452:blk_1073747594_6781, duration: 14642101541
2016-05-16 08:26:01,750 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-130837870-192.168.178.29-1462900512452:blk_1073747594_6781, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2016-05-16 08:26:17,445 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow flushOrSync took 304ms (threshold=300ms), isSync:false, flushTotalNanos=304174121ns
2016-05-16 08:26:22,832 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-130837870-192.168.178.29-1462900512452:blk_1073747595_6782 src: /127.0.0.1:47502 dest: /127.0.0.1:50010
2016-05-16 08:26:23,385 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:432ms (threshold=300ms)
2016-05-16 08:26:24,684 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:47502, dest: /127.0.0.1:50010, bytes: 11485542, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1749190019_1, offset: 0, srvID: 7bfbedce-3f6d-4f95-8eb4-d96438fedc36, blockid: BP-130837870-192.168.178.29-1462900512452:blk_1073747595_6782, duration: 1851995254
2016-05-16 08:26:24,685 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-130837870-192.168.178.29-1462900512452:blk_1073747595_6782, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2016-05-16 08:26:45,373 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:405ms (threshold=300ms)
2016-05-16 08:27:16,631 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-130837870-192.168.178.29-1462900512452:blk_1073747596_6783 src: /127.0.0.1:49517 dest: /127.0.0.1:50010
2016-05-16 08:27:17,729 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:49517, dest: /127.0.0.1:50010, bytes: 3455293, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1749190019_1, offset: 0, srvID: 7bfbedce-3f6d-4f95-8eb4-d96438fedc36, blockid: BP-130837870-192.168.178.29-1462900512452:blk_1073747596_6783, duration: 1096622112
2016-05-16 08:27:17,729 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-130837870-192.168.178.29-1462900512452:blk_1073747596_6783, type=LAST_IN_PIPELINE, downstreams=0:[] terminating

...some Slow BlockReceiver messages...

org.apache.hadoop.hdfs.server.datanode.VolumeScanner: VolumeScanner(/home/hb/seritrack-mts/nosql/data/data, DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017): no suitable block pools found to scan.  Waiting 1334756421 ms.
2016-05-16 08:30:07,254 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:375ms (threshold=300ms)
2016-05-16 08:31:32,604 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: Scheduling blk_1073747556_6743 file /home/hb/seritrack-mts/nosql/data/data/current/BP-130837870-192.168.178.29-1462900512452/current/finalized/subdir0/subdir22/blk_1073747556 for deletion

... some repetition...

2016-05-16 08:31:39,276 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-130837870-192.168.178.29-1462900512452:blk_1073747597_6784 src: /127.0.0.1:59092 dest: /127.0.0.1:50010
2016-05-16 08:31:40,380 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:59092, dest: /127.0.0.1:50010, bytes: 11480560, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1749190019_1, offset: 0, srvID: 7bfbedce-3f6d-4f95-8eb4-d96438fedc36, blockid: BP-130837870-192.168.178.29-1462900512452:blk_1073747597_6784, duration: 1093600168
2016-05-16 08:31:40,380 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-130837870-192.168.178.29-1462900512452:blk_1073747597_6784, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2016-05-16 08:31:41,563 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-130837870-192.168.178.29-1462900512452:blk_1073747598_6785 src: /127.0.0.1:59147 dest: /127.0.0.1:50010
2016-05-16 08:31:44,761 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:408ms (threshold=300ms)
2016-05-16 08:31:48,238 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:411ms (threshold=300ms)
2016-05-16 08:31:51,398 INFO org.apache.hadoop.hdfs.server.datanode.VolumeScanner: VolumeScanner(/home/hb/seritrack-mts/nosql/data/data, DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017): no suitable block pools found to scan.  Waiting 1334619581 ms.
2016-05-16 08:31:51,529 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:389ms (threshold=300ms)

...some more of those ...

2016-05-16 08:31:57,547 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:59147, dest: /127.0.0.1:50010, bytes: 268435456, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1749190019_1, offset: 0, srvID: 7bfbedce-3f6d-4f95-8eb4-d96438fedc36, blockid: BP-130837870-192.168.178.29-1462900512452:blk_1073747598_6785, duration: 15718340373
2016-05-16 08:31:57,547 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-130837870-192.168.178.29-1462900512452:blk_1073747598_6785, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2016-05-16 08:31:57,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-130837870-192.168.178.29-1462900512452:blk_1073747599_6786 src: /127.0.0.1:59440 dest: /127.0.0.1:50010
2016-05-16 08:31:58,657 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:343ms (threshold=300ms)

...some more of those ...

2016-05-16 08:32:13,559 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:59440, dest: /127.0.0.1:50010, bytes: 268435456, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1749190019_1, offset: 0, srvID: 7bfbedce-3f6d-4f95-8eb4-d96438fedc36, blockid: BP-130837870-192.168.178.29-1462900512452:blk_1073747599_6786, duration: 15809336101
2016-05-16 08:32:13,559 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-130837870-192.168.178.29-1462900512452:blk_1073747599_6786, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2016-05-16 08:32:14,071 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-130837870-192.168.178.29-1462900512452:blk_1073747600_6787 src: /127.0.0.1:59678 dest: /127.0.0.1:50010
2016-05-16 08:32:16,251 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:493ms (threshold=300ms)


...some more of those ...

2016-05-16 08:32:29,087 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:59678, dest: /127.0.0.1:50010, bytes: 268435456, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1749190019_1, offset: 0, srvID: 7bfbedce-3f6d-4f95-8eb4-d96438fedc36, blockid: BP-130837870-192.168.178.29-1462900512452:blk_1073747600_6787, duration: 15015849046
2016-05-16 08:32:29,087 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-130837870-192.168.178.29-1462900512452:blk_1073747600_6787, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2016-05-16 08:32:29,171 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-130837870-192.168.178.29-1462900512452:blk_1073747601_6788 src: /127.0.0.1:59907 dest: /127.0.0.1:50010
2016-05-16 08:32:30,420 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:322ms (threshold=300ms)

...some more of those ...

2016-05-16 08:32:42,447 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /127.0.0.1:59907, dest: /127.0.0.1:50010, bytes: 205882206, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1749190019_1, offset: 0, srvID: 7bfbedce-3f6d-4f95-8eb4-d96438fedc36, blockid: BP-130837870-192.168.178.29-1462900512452:blk_1073747601_6788, duration: 13275544186
2016-05-16 08:32:42,447 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-130837870-192.168.178.29-1462900512452:blk_1073747601_6788, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2016-05-16 08:33:46,104 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Likely the client has stopped reading, disconnecting it (hb-desktop:50010:DataXceiver error processing READ_BLOCK operation  src: /127.0.0.1:46586 dst: /127.0.0.1:50010); java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/127.0.0.1:50010 remote=/127.0.0.1:46586]

Namenode says:


2016-05-16 08:30:39,662 INFO org.apache.hadoop.hdfs.StateChange: *DIR* reportBadBlocks
2016-05-16 08:30:39,662 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_1073746899 to add as corrupt on 127.0.0.1:50010 by /127.0.0.1  because client machine reported it
2016-05-16 08:30:41,844 INFO org.apache.hadoop.hdfs.StateChange: *DIR* reportBadBlocks
2016-05-16 08:30:41,844 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_1073746899 to add as corrupt on 127.0.0.1:50010 by /127.0.0.1  because client machine reported it
2016-05-16 08:30:43,495 INFO org.apache.hadoop.hdfs.StateChange: *DIR* reportBadBlocks
2016-05-16 08:30:43,495 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_1073746899 to add as corrupt on 127.0.0.1:50010 by /127.0.0.1  because client machine reported it
2016-05-16 08:30:47,435 INFO org.apache.hadoop.hdfs.StateChange: *DIR* reportBadBlocks
2016-05-16 08:30:47,436 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_1073746899 to add as corrupt on 127.0.0.1:50010 by /127.0.0.1  because client machine reported it
2016-05-16 08:30:51,842 INFO org.apache.hadoop.hdfs.StateChange: *DIR* reportBadBlocks
2016-05-16 08:30:51,842 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_1073746899 to add as corrupt on 127.0.0.1:50010 by /127.0.0.1  because client machine reported it
2016-05-16 08:30:57,476 INFO org.apache.hadoop.hdfs.StateChange: *DIR* reportBadBlocks
2016-05-16 08:30:57,476 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_1073746899 to add as corrupt on 127.0.0.1:50010 by /127.0.0.1  because client machine reported it
2016-05-16 08:31:06,539 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2016-05-16 08:31:06,539 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
2016-05-16 08:31:08,643 INFO org.apache.hadoop.hdfs.StateChange: *DIR* reportBadBlocks
2016-05-16 08:31:08,643 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_1073746899 to add as corrupt on 127.0.0.1:50010 by /127.0.0.1  because client machine reported it
2016-05-16 08:31:21,129 INFO org.apache.hadoop.hdfs.StateChange: *DIR* reportBadBlocks
2016-05-16 08:31:21,130 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_1073746899 to add as corrupt on 127.0.0.1:50010 by /127.0.0.1  because client machine reported it
2016-05-16 08:31:28,993 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1073747591_6778 127.0.0.1:50010
2016-05-16 08:31:28,993 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 60 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 36 SyncTimes(ms): 11435
2016-05-16 08:31:29,186 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1073747556_6743 127.0.0.1:50010
2016-05-16 08:31:29,321 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1073747574_6761 127.0.0.1:50010
2016-05-16 08:31:29,674 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1073747593_6780 127.0.0.1:50010
2016-05-16 08:31:29,713 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1073747558_6745 127.0.0.1:50010
2016-05-16 08:31:29,796 INFO BlockStateChange: BLOCK* addToInvalidates: blk_1073747575_6762 127.0.0.1:50010
2016-05-16 08:31:30,237 INFO BlockStateChange: BLOCK* BlockManager: ask 127.0.0.1:50010 to delete [blk_1073747556_6743, blk_1073747558_6745, blk_1073747574_6761, blk_1073747575_6762, blk_1073747591_6778, blk_1073747593_6780]
2016-05-16 08:31:32,007 INFO org.apache.hadoop.hdfs.StateChange: *DIR* reportBadBlocks
2016-05-16 08:31:32,007 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_1073746899 to add as corrupt on 127.0.0.1:50010 by /127.0.0.1  because client machine reported it
2016-05-16 08:31:36,540 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2016-05-16 08:31:36,540 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2016-05-16 08:31:38,849 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/.tmp/9652e091531943848ee523a60bc5baa5. BP-130837870-192.168.178.29-1462900512452 blk_1073747597_6784{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017:NORMAL:127.0.0.1:50010|RBW]]}
2016-05-16 08:31:40,381 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_1073747597_6784{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017:NORMAL:127.0.0.1:50010|RBW]]} size 0
2016-05-16 08:31:40,745 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/.tmp/9652e091531943848ee523a60bc5baa5 is closed by DFSClient_NONMAPREDUCE_-1749190019_1
2016-05-16 08:31:41,294 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/.tmp/fb79c8ab90a5498089841431191c03ca. BP-130837870-192.168.178.29-1462900512452 blk_1073747598_6785{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017:NORMAL:127.0.0.1:50010|RBW]]}
2016-05-16 08:31:57,547 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_1073747598_6785{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017:NORMAL:127.0.0.1:50010|RBW]]} size 0
2016-05-16 08:31:57,551 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/.tmp/fb79c8ab90a5498089841431191c03ca. BP-130837870-192.168.178.29-1462900512452 blk_1073747599_6786{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017:NORMAL:127.0.0.1:50010|RBW]]}
2016-05-16 08:32:06,539 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2016-05-16 08:32:06,539 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
2016-05-16 08:32:13,559 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_1073747599_6786{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017:NORMAL:127.0.0.1:50010|RBW]]} size 0
2016-05-16 08:32:13,875 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/.tmp/fb79c8ab90a5498089841431191c03ca. BP-130837870-192.168.178.29-1462900512452 blk_1073747600_6787{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017:NORMAL:127.0.0.1:50010|RBW]]}
2016-05-16 08:32:29,087 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_1073747600_6787{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017:NORMAL:127.0.0.1:50010|RBW]]} size 0
2016-05-16 08:32:29,088 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/.tmp/fb79c8ab90a5498089841431191c03ca. BP-130837870-192.168.178.29-1462900512452 blk_1073747601_6788{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017:NORMAL:127.0.0.1:50010|RBW]]}
2016-05-16 08:32:29,088 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 91 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 57 SyncTimes(ms): 14475
2016-05-16 08:32:36,539 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2016-05-16 08:32:36,539 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
2016-05-16 08:32:42,043 INFO org.apache.hadoop.hdfs.StateChange: *DIR* reportBadBlocks
2016-05-16 08:32:42,044 INFO BlockStateChange: BLOCK NameSystem.addToCorruptReplicasMap: duplicate requested for blk_1073746899 to add as corrupt on 127.0.0.1:50010 by /127.0.0.1  because client machine reported it
2016-05-16 08:32:42,447 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_1073747601_6788{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-9cc4b81b-dbe3-4da1-a394-9ca30db55017:NORMAL:127.0.0.1:50010|RBW]]} size 0
2016-05-16 08:32:42,495 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /hbase/data/default/tt_items/08255086d13380bd559a87dd93cc15ba/.tmp/fb79c8ab90a5498089841431191c03ca is closed by DFSClient_NONMAPREDUCE_-1749190019_1

Upvotes: 1

Views: 2670

Answers (1)

gnomie
gnomie

Reputation: 439

As there was no further response as it seems that HDFS-8809 is indeed the source of the problem - that is more of a confusion than a real problem, I am suggesting that as an answer.

Upvotes: 1

Related Questions