Reputation: 25
I am converting some custom files that I have into hadoop Sequence Files using the Java API.
I am reading byte arrays from a local file and append them to a sequence file as pairs of Index (Integer) - Data (Byte[]):
InputStream in = new BufferedInputStream(new FileInputStream(localSource));
FileSystem fs = FileSystem.get(URI.create(hDFSDestinationDirectory),conf);
Path sequenceFilePath = new Path(hDFSDestinationDirectory + "/"+ "data.seq");
IntWritable key = new IntWritable();
BytesWritable value = new BytesWritable();
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf,
sequenceFilePath, key.getClass(), value.getClass());
for (int i = 1; i <= nz; i++) {
byte[] imageData = new byte[nx * ny * 2];
in.read(imageData);
key.set(i);
value.set(imageData, 0, imageData.length);
writer.append(key, value);
}
IOUtils.closeStream(writer);
in.close();
I do exactly the opposite when I want to bring the files back to the initial format:
for (int i = 1; i <= nz; i++) {
reader.next(key, value);
int byteLength = value.getLength();
byte[] tempValue = value.getBytes();
out.write(tempValue, 0, byteLength);
out.flush();
}
I noticed that writting to SequenceFile takes almost an order of magnitude more than reading. I expect writting to be slower than reading but is this difference normal? Why?
More Info:
The byte arrays I read are 2MB size (nx=ny=1024 and nz=128)
I am testing in pseudo-distributed mode.
Upvotes: 0
Views: 3605
Reputation: 30089
Are nx
and ny
constants?
One reason you could be seeing this is that each iteration of your for loop creates a new byte array. This requires the JVM to allocate you some heap space. If the array is sufficiently large, this is going to be expensive, and eventually you're going to run into the GC. I'm not too sure on what HotSpot might do to optimize this out however.
My suggestion would be to create a single BytesWritable:
// use DataInputStream so you can call readFully()
DataInputStream in = new DataInputStream(new FileInputStream(localSource));
FileSystem fs = FileSystem.get(URI.create(hDFSDestinationDirectory),conf);
Path sequenceFilePath = new Path(hDFSDestinationDirectory + "/"+ "data.seq");
IntWritable key = new IntWritable();
// create a BytesWritable, which can hold the maximum possible number of bytes
BytesWritable value = new BytesWritable(new byte[maxPossibleSize]);
// grab a reference to the value's underlying byte array
byte byteBuf[] = value.getBytes();
SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf,
sequenceFilePath, key.getClass(), value.getClass());
for (int i = 1; i <= nz; i++) {
// work out how many bytes to read - if this is a constant, move outside the for loop
int imageDataSize nx * ny * 2;
// read in bytes to the byte array
in.readFully(byteBuf, 0, imageDataSize);
key.set(i);
// set the actual number of bytes used in the BytesWritable object
value.setSize(imageDataSize);
writer.append(key, value);
}
IOUtils.closeStream(writer);
in.close();
Upvotes: 1
Reputation: 782
You are reading from local disk and writing to HDFS. When you write to HDFS your data is probably being replicated so it is physically written two or three times depending on what you have set for the replication factor.
So you are not only writing but writing two or three times the amount of data you are reading. And your writes are going over the network. Your reads are not.
Upvotes: 1