Reputation: 955
Hi I am trying to setup the hadoop environment. In short the problem which I am trying to solve involves billions of XML files of size few MB, extract relevant information from them using HIVE and do some analytic work with the information. I know this is a trivial problem in hadoop world but if Hadoop solution works well for me than size and number of files I will be dealing will increase in geometric progession form.
I did research by referring various books like "Hadoop - the definite guide", "Hadoop in action". Resources like documents by yahoo and hortonworks. I am not able to figure out the hardware /software specifications for establishing the hadoop environment. In the resources which I had referred so far I had kind of found standard solutions like
but if anyone can give some suggestions that will be great. Thanks
Upvotes: 2
Views: 4760
Reputation: 401
"extract relevant information from them using HIVE" That is going to be a bit tricky since hive doesn't really do well with xml files. you are going to want to build a parsing script in another language (ruby, python, perl, etc) that can parse the xml files and produce columnar output that you will load into hive. You can then use hive to call that external parsing script with a transform, or just use hadoopstreaming to prepare the data for hive. Then it is just a matter of how fast you need the work done and how much space you need to hold the amount of data you are going to have.
you could build the process with a handful of files on a single system to test it. But you really need to have a better handle on your overall planned workload to properly scale your cluster. Minimum production cluster size would be 3 or 4 machines at a minimum, just for data redundancy. Beyond that, add nodes as necessary to meet your workload needs.
Upvotes: 0
Reputation: 8088
First I would suggest you to consider: what do you need more processing + some storage or opposite, and from this view select hardware. Your case sounds more processing then storage.
I would specify a bit differently standard hardware for hadoop
NameNode: High quality disk in mirror, 16 GB HDD.
Data Nodes: 16-24 GB RAM, Dual Quad or Dual six cores CPU, 4 to 6 1-2-3 SATA TB Drives.
I would also consider 10 GBit option. I think if it does not add more then 15% of cluster price - it makes sense. 15% came from rough estimation that data shipping from mappers to reducers takes about 15% of job time.
In your case I would be more willing to sacrifice disc sizes to save money, but not CPU/Memory/number of drives.
Upvotes: 1