Reputation: 6215
So I'm trying to understand some behavior in HDFS. My goal is to set up a configuration where I open an FSDataOutputStream to some location and then I have some other part of my application open up an FSDataInputStream to that same location immediately, before I write any bytes.
The idea would be that as I write bytes to the FSDataOutputStream, flush them, and call 'sync()', anyone with access to an FSDataInputStream to that same location should be able to read those bytes.
Sadly, it doesn't seem to work that way. When I set up my code that way this happens:
FSDataOutputStream writer = fs.create(new Path("/foo/bar"));
FSDataInputStream reader = fs.open(new Path("/foo/bar"));
writer.write(new byte[]{1, 1, 1, 1, 1});
writer.flush();
writer.sync();
System.out.println(reader.available()); // writes '0'
However! When I set up my code this way, this happens:
FSDataOutputStream writer = fs.create(new Path("/foo/bar"));
writer.write(new byte[] {1, 1, 1, 1, 1});
writer.flush();
writer.sync();
FSDataInputStream reader = fs.open(new Path("/foo/bar"));
System.out.println(reader.available()); // writes '5'
Finally, the third test I ran was this:
FSDataOutputStream writer = fs.create(new Path("/foo/bar"));
writer.write(new byte[] {1, 1, 1, 1, 1});
writer.flush();
writer.sync();
FSDataInputStream reader = fs.open(new Path("/foo/bar"));
writer.write(new byte[] {2, 2, 2, 2, 2});
writer.flush();
writer.sync();
System.out.println(reader.available()); // writes '5'
My takeaway is that FSDataInputStream is always going to be limited in scope to those bytes that were already written when the input stream was created. Is there any way around this? I don't see a 'refresh()' method on the input stream or anything like that.
I would really, really like it if there was some way for me to force an input stream to update its available bytes. What am I missing? What am I doing wrong? Is this simply the wrong way to do stuff like this?
Upvotes: 2
Views: 871
Reputation: 10941
As far as I can tell, DFSInputStream
only refreshes its list of located blocks on open and when it has encounters an error trying to read from a block. So regardless of what you do in the output stream, the input stream won't be updated.
If you are trying to implement a single-producer/multiple-consumer system, you might look into using something like zookeeper for coordination.
Upvotes: 1