Reputation: 33252
I'm learning hadoop, and I'm doing some experiment on a project that could go in production as a big data project. At the moment anyway I'm just doing some test with a small amount data. The scenario is as follow there is a bounch of json files that I load in pig as below:
a = load 's3n://mybucket/user_*.json' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map []);
b = FOREACH a GENERATE flatten(json#'user') as (m:map[]) ;
let's say file are small, they contain just one object, but there is a bounch of them. I'm supposing the FOREACH would work in parallel opening more file at once, am I wrong? Programs take a while to run about 10 seconds on an amazon c3.xlarge istance, and there is about 400 files. I'm sure if I do a program in C# it will take fraction of second to run, where am I wrong?
Upvotes: 0
Views: 349
Reputation: 1115
Pig runs task as parallel, there is some amount of time pig spends initially becuase it runs as mapreduce and optimizes the whole script, so operating on small data set will be slower in pig. It should be used for big dataset. To increase the number of parallel task in pig for small data, you can used the PARALLEL command in the FOREACH line, else you can overall increase the number of reducer by set default_parallel n, to set the parallelism to n. The last case can be that pig is running all task as mapper, and the number of mapper is too small as your file size is small, you have to change some yarn configuration to increase the number of mappers.
Upvotes: 1