Reputation: 11
I am running an ETL job with Hadoop where I need to output the valid, transformed data to HBase, and an external index for that data into MySQL. My initial thought is that I could use MultipleOutputFormats to export the the transformed data with HFileOutputFormat (key is Text and value is ProtobufWritable), and an index to TextOutputFormat (key is Text and value is Text).
The number of inputs records for an average-sized job (I'll need the ability to run many at once) is about 700 million.
I'm wondering if A) this seems to be a reasonable approach in terms of efficiency and complexity, and B) how to accomplish this with the CDH3 distribution's API, if possible.
Upvotes: 1
Views: 717
Reputation: 2181
If you're using the old MapReduce API then you can use MultipleOutputs and write to multiple output formats.
However, if you're using the new MapReduce API, I'm not sure that there is a way to do what you're trying to do. You might have to pay the price of doing another MapReduce job on the same inputs. But I'll have to do more research on it before saying for sure. There might be a way to hack the old + new api's together to allow you to use MultipleOutputs with the new API.
EDIT: Have a look at this post. You can probably implement your own OutputFormat and wrap the appropriate RecordWriters in the OutputFormat and use that to write to multiple output formats.
Upvotes: 1