Reputation: 330
I would like to know if someone has experience on storing large files on dfs and then reading it; for example I want to have thousands of records which describe a single object by they are different instances of it. For example I have the following class implementation that would describe the object:
class someclass {
attr1
attr2
attr3
....
}
The class is the same but I would have different instances of it. Which is better for use in Hadoop, a binary type storage (to write a serializer and dump it) or ASCII and just parse them at will?
I must also mention that the number of attributes in it might be altered and be a bit different in the future. If possible I'd like to avoid updating the class instances already written in the dfs.
Upvotes: 0
Views: 335
Reputation: 2345
Use Avro binary serialization. You can't use the same class in this case, but it will look the same in terms of attributes and types. Avro has a very flexible schema support, it is splittable and fully supported by Hadoop out-of-the-box.
Your class' schema will look like this:
{"namespace": "your.package.name",
"type": "record",
"name": "SomeClass",
"fields": [
{"name": "attr1", "type": "YourType1"},
{"name": "attr2", "type": "YourType2"},
{"name": "attr3", "type": "YourType3"}
]
}
Upvotes: 1