hadoop input format binary or ASCII

Question

I would like to know if someone has experience on storing large files on dfs and then reading it; for example I want to have thousands of records which describe a single object by they are different instances of it. For example I have the following class implementation that would describe the object:

class someclass {
    attr1
    attr2
    attr3
    ....
}

The class is the same but I would have different instances of it. Which is better for use in Hadoop, a binary type storage (to write a serializer and dump it) or ASCII and just parse them at will?

I must also mention that the number of attributes in it might be altered and be a bit different in the future. If possible I'd like to avoid updating the class instances already written in the dfs.

Viacheslav Rodionov · Accepted Answer

Use Avro binary serialization. You can't use the same class in this case, but it will look the same in terms of attributes and types. Avro has a very flexible schema support, it is splittable and fully supported by Hadoop out-of-the-box.

Your class' schema will look like this:

{"namespace": "your.package.name",
 "type": "record",
 "name": "SomeClass",
 "fields": [
     {"name": "attr1", "type": "YourType1"},
     {"name": "attr2", "type": "YourType2"},
     {"name": "attr3", "type": "YourType3"}
 ]
}

hadoop input format binary or ASCII

Answers (1)

Related Questions