Reputation: 131
I have more of a design question regarding the necessity of a CustomWritable for my use case:
So I have a document pair that I will process through a pipeline and write out intermediate and final data to HDFS. My key will be something like ObjectId - DocId - Pair - Lang. I do not see why/if I will need a CustomWritable for this use case. I guess if I did not have a key, I would need a CustomWritable? Also, when I write data out to HDFS in the Reducer, I use a Custom Partitioner. So, that would kind of eliminate my need for a Custom Writable?
I am not sure if I got the concept of the need for a Custom Writable right. Can someone point me in the right direction?
Upvotes: 1
Views: 557
Reputation: 1027
Thomas' answer explains a bit. Its way too late but I'd like to add the following for prospective readers:
Partitioner only comes into play between the map and reduce phase and has no role to play in writing from reducer to output files.
I don't believe writing INTERMEDIATE data to hdfs is a requirement in most cases, although there are some hacks that can be applied to do the same.
When you write from a reducer to hdfs, the keys will automatically be sorted and each reducer will write to ONE SEPARATE file. Based on their compareTo
method, keys are sorted. So if you want to sort based on multiple variables, go for a Custom key class that extends WritableComparable
, and implement the write
, readFields
and compareTo
methods. You can now control the way the keys are sorted, based on the compareTo
implementation
Upvotes: 0
Reputation: 20969
Writables can be used for de/serializing objects. For example a log entry can contain a timestamp, an user IP and the browser agent. So you should implement your own WritableComparable for a key that identifies this entry and you should implement a value class that implements Writable that reads and writes the attributes in your log entry.
These serializations are just a handy way to get the data from a binary format to an object. Some Frameworks like HBase still require byte arrays to persist the data. So you'll have a lot of overhead in transfering this by yourself and messes up your code.
Upvotes: 1