Reputation: 273
Despite all the resources about this subject, I have issues flushing my hdfs files on disk (hadoop 2.6)
Calling FSDataOutputStream.hsync()
should do the trick, but it actually only works once for unknown reasons...
Here is a simple unit test that fails:
@Test
public void test() throws InterruptedException, IOException {
final FileSystem filesys = HdfsTools.getFileSystem();
final Path file = new Path("myHdfsFile");
try (final FSDataOutputStream stream = filesys.create(file)) {
Assert.assertEquals(0, getSize(filesys, file));
stream.writeBytes("0123456789");
stream.hsync();
stream.hflush();
stream.flush();
Thread.sleep(100);
Assert.assertEquals(10, getSize(filesys, file)); // Works
stream.writeBytes("0123456789");
stream.hsync();
stream.hflush();
stream.flush();
Thread.sleep(100);
Assert.assertEquals(20, getSize(filesys, file)); // Fails, still 10
}
Assert.assertEquals(20, getSize(filesys, file)); // works
}
private long getSize(FileSystem filesys, Path file) throws IOException {
return filesys.getFileStatus(file).getLen();
}
Any idea why?
Upvotes: 3
Views: 1857
Reputation: 273
In fact, hsync()
internally calls the private flushOrSync(boolean isSync, EnumSet<SyncFlag> syncFlags)
with no flag, and the length is only updated on the namenode if SyncFlag.UPDATE_LENGTH
is provided.
In the above test, replacing getSize()
by a code that actually reads the file works.
private long getSize(FileSystem filesys, Path file) throws IOException {
long length = 0;
try (final FSDataInputStream input = filesys.open(file)) {
while (input.read() >= 0) {
length++;
}
}
return length;
}
To update the size, you can alternatively call (without the proper class type checking):
((DFSOutputStream) stream.getWrappedStream())).hsync(EnumSet.of(SyncFlag.UPDATE_LENGTH));
Upvotes: 4