FtoTheZ
FtoTheZ

Reputation: 426

Reading many small files from S3 very slow

Loading many small files (>200000, 4kbyte) from a S3 Bucket into HDFS via Hive or Pig on AWS EMR is extremely slow. It seems that only one mapper is used to get the data, though I cannot exactly figure out where the bottleneck is.

Pig Code Sample

data = load 's3://data-bucket/'  USING PigStorage(',') AS (line:chararray)

Hive Code Sample

CREATE EXTERNAL TABLE data (value STRING) LOCATION  's3://data-bucket/';

Are there any known settings that speed up the process or increase the number of mappers used to fetch the data?

I tried the following without any noticeable effects:

I know that s3distcp would speed up the process, but I could only get better performance by doing a lot of tweaking including setting #workerThreads and would prefer changing parameters directly in my PIG/Hive scripts.

Upvotes: 2

Views: 2642

Answers (1)

glefait
glefait

Reputation: 1691

You can either :

  1. use distcp to merge the file before your job starts : http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/

  2. have a pig script that will do it for you, once.

If you want to do it through PIG, you need to know how many mappers are spawned. You can play with the following parameters :

//  to set mapper = nb block size. Set to true for one per file.
SET pig.noSplitCombination false;
// set size to have SUM(size) / X = wanted number of mappers
SET pig.maxCombinedSplitSize 250000000;

Please provide metrics for thoses cases

Upvotes: 2

Related Questions