Ingesting large files into Hive on a single node Hadoop

Question

I want to ingest large csv files(up to 6 GB) on a regular basis into a Hadoop single node with 32 GB RAM. They key requirement is to register the data in HCatalog. (Please do not discuss requirements, it is a functional demo). Performance is not essential. The hive tables shall be partitioned.

So far I was using Pig. Lessons learned so far is that the main challenge is the Heap. The generated MapReduce jobs fill up the heap quickly and once Java is 98% of the time garbage collecting, there is an overflow.

One solution might be to chunk the large files into smaller pieces... However, I also consider that a different technology than Pig might not fill up the Heap that much. Any ideas on how to approach such a use case? thx

Stefan Papp · Accepted Answer

The best thing for this is using HiveQL instead of Pig(LOAD). It is based just on filetransfer, no MR jobs

Ingesting large files into Hive on a single node Hadoop

Answers (1)

Related Questions