Reputation: 41
Im trying to run a simple query on a table with one partition which has around 200-300k records all of them are small files of 120bytes.
I'm using a custom INPUTFORMAT which reads the file contents and then query another s3 file to fetch the actual data.Each File corresponds to one record.
The query is taking around 6 hours to complete. I am using a cluster of 10 machines of type m2.4xlarge on EMR.
Looking in to the logs there is a one hour delay between starting the job and starting the map reduce tasks.Also the number of mappers/tasks are shown as only 1.
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
Is there anything I'm missing? Looks like there is no parallel execution at all. I tried setting the following properties but no improvement at all:
mapreduce.job.counters.limit 1000
mapred.tasktracker.tasks.maximum 1000
mapred.tasktracker.map.tasks.maximum 100
mapred.tasktracker.reduce.tasks.maximum 95
mapred.map.tasks 100
mapred.child.java.opts -Xmx15048m
namenide-heap-size 15048
Below are the tables and query details.
CREATE EXTERNAL TABLE IF NOT EXISTS sample(
x string,
y date,
)
PARTITIONED BY (date STRING)
ROW FORMAT SERDE "com.gts.hive.analytics.store.serde.CustomSerDe"
STORED AS INPUTFORMAT 'com.gts.hive.analytics.store.formats.mapred.GZipJsonFileInputFormat2'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3n://xyzlocation/';
ALTER TABLE sample ADD IF NOT EXISTS PARTITION(date='2013-12-31-07');
select x from sample;
Upvotes: 1
Views: 3320
Reputation: 83
You can try to simply add a cluster by rand()
at the end of your query, I don't quite understand why but it seems join
, group by
, cluster by
etc enables some sort of shuffle for map-only jobs. otherwise the number of reducers is auto-magically forced to 1.
Upvotes: 0
Reputation: 3047
gzip is not splittable, any Hadoop processing of gzipped data will lead to one mapper. There are splittable compression formats such as bzip you can use. More info available at http://comphadoop.weebly.com/
Upvotes: 1