Reputation: 3303
I want to ask why the Hadoop Framework, which implements the MapReduce distributed programming paradigm, uses a Text class to store a String when Java already has Strings implemented for us to use? It seems unnecessarily redundant (lol).
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Text.html
Upvotes: 1
Views: 1635
Reputation: 34184
Redundant???
Let me shed some light. When we talk about distributed systems efficient Serialization/Deserialization plays a vital role. It appears in two quite distinct areas of distributed data processing :
To be specific to Hadoop, IPC between nodes is implemented using RPCs. The RPC protocol uses serialization to render the message into a binary stream to be sent to the remote node, which then deserializes the binary stream into the original message. So, it is very important to have a solid Serialization/Deserialization framework in order to store and process huge amounts of data efficiently. In general, it is desirable that an RPC serialization format is:
Hadoop uses its own types because developers wanted the storage format to be compact (to make efficient use of storage space), fast (so the overhead in reading or writing terabytes of data is minimal), extensible (so we can transparently read data written in an older format), and interoperable (so we can read or write persistent data using different languages).
Few points to remember before thinking that having dedicated MapReduce types is redundant :
HTH
Upvotes: 1
Reputation: 644
Why can't I use the basic String or Integer classes?
Integer and String implement the standard Serializable-interface of Java . The problem is that MapReduce serializes/deserializes values not utilizing this standard interface but rather an own interface, which is called Writable.
The key and value classes have to be serializable by the framework and hence need to implement
the Writable interface. Additionally, the key classes have to implement the WritableComparable
interface to facilitate sorting by the framework.
Here is the link to MapReduce Tutorial
Upvotes: 0
Reputation: 900
They have implemented their own class Text for String, LongWritable for Long, IntWritable for Integers.
Purpose behind adding these class is to define their own basic types for optimized network serialization. These are found in the org.apache.hadoop.io package.
This types produces a compact serialized object to makes best use of network bandwidth. And Hadoop is meant to process big data so network bandwidth is the most precious resource they want to use in very effective way. Plus for this class they have reduced the overhead of serialization and deserialization of these object as compared to Java's native types.
Upvotes: 4