Reputation: 2255
I want to ingest large csv files(up to 6 GB) on a regular basis into a Hadoop single node with 32 GB RAM. They key requirement is to register the data in HCatalog. (Please do not discuss requirements, it is a functional demo). Performance is not essential. The hive tables shall be partitioned.
So far I was using Pig. Lessons learned so far is that the main challenge is the Heap. The generated MapReduce jobs fill up the heap quickly and once Java is 98% of the time garbage collecting, there is an overflow.
One solution might be to chunk the large files into smaller pieces... However, I also consider that a different technology than Pig might not fill up the Heap that much. Any ideas on how to approach such a use case? thx
Upvotes: 1
Views: 193
Reputation: 2255
The best thing for this is using HiveQL instead of Pig(LOAD). It is based just on filetransfer, no MR jobs
Upvotes: 1