Yuriy
Yuriy

Reputation: 1984

Amazon Elastic MapReduce - mass insert from S3 to DynamoDB is incredibly slow

I need to perform an initial upload of roughly 130 million items (5+ Gb total) into a single DynamoDB table. After I faced problems with uploading them using the API from my application, I decided to try EMR instead.

Long story short, the import of that very average (for EMR) amount of data takes ages even on the most powerful cluster, consuming hundreds of hours with very little progress (about 20 minutes to process test 2Mb data bit, and didn't manage to finish with the test 700Mb file in 12 hours).

I have already contacted Amazon Premium Support, but so far they only told that "for some reason DynamoDB import is slow".

I have tried the following instructions in my interactive hive session:

CREATE EXTERNAL TABLE test_medium (
  hash_key string,
  range_key bigint,
  field_1 string,
  field_2 string,
  field_3 string,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LOCATION 's3://my-bucket/s3_import/'
;

CREATE EXTERNAL TABLE ddb_target (
  hash_key string,
  range_key bigint,
  field_1 bigint,
  field_2 bigint,
  field_3 bigint,
  field_4 bigint,
  field_5 bigint,
  field_6 string,
  field_7 bigint
)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
  "dynamodb.table.name" = "my_ddb_table",
  "dynamodb.column.mapping" = "hash_key:hash_key,range_key:range_key,field_1:field_1,field_2:field_2,field_3:field_3,field_4:field_4,field_5:field_5,field_6:field_6,field_7:field_7"
)
;  

INSERT OVERWRITE TABLE ddb_target SELECT * FROM test_medium;

Various flags doesn't seem to have any visible effect. Have tried the following settings instead of default ones:

SET dynamodb.throughput.write.percent = 1.0;
SET dynamodb.throughput.read.percent = 1.0;
SET dynamodb.endpoint=dynamodb.eu-west-1.amazonaws.com;
SET hive.base.inputformat=org.apache.hadoop.hive.ql.io.HiveInputFormat;
SET mapred.map.tasks = 100;
SET mapred.reduce.tasks=20;
SET hive.exec.reducers.max = 100;
SET hive.exec.reducers.min = 50;

The same commands run for HDFS instead of DynamoDB target were completed in seconds.

That seems to be a simple task, a very basic use case, and I really wonder what can I be doing wrong here.

Upvotes: 17

Views: 7418

Answers (2)

Luciano Marqueto
Luciano Marqueto

Reputation: 1198

I was faced with the same problem in the last week. I maked some annotations that improve the time of writing the data in the DynamoDB

  1. look for the input files, if they are compacted the Hive can't split more than the number of files and you will reduce the possible number of mappers .

  2. Set the number of reducers to 1 or -1, appears they do not use so much, it will open slots for the mappers.

  3. In dynamodb if you are using provided capacity you need to set the number of wcu that you want to use. remember that hive will try not to consume more than the percentage in dynamodb.throughput.write.percent. if you are using auto-scaling, set the write.percent highest that target percentage to guarantee it will scale. Or put it ondemand and don't worry about this, but it is more expensive.

  4. You can change the memory configuration of instances to try to get more mappers, in the page above to can look the default configurations, make it changing mapreduce.map.memory.mb and mapreduce.reduce.memory.mb. Be careful here you can get out of memory error. https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html

Some links related

http://cloudsqale.com/2018/10/22/tez-internals-1-number-of-map-tasks/

https://github.com/awslabs/emr-dynamodb-connector

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.PerformanceTuning.Mappers.html

Upvotes: 0

Yuriy
Yuriy

Reputation: 1984

Here is the answer I finally got from AWS support recently. Hope that helps someone in a similar situation:

EMR workers are currently implemented as single threaded workers, where each worker writes items one-by-one (using Put, not BatchWrite). Therefore, each write consumes 1 write capacity unit (IOP).

This means that you are establishing a lot of connections which decreases performance to some degree. If BatchWrites were used, it would mean you could commit up to 25 rows in a single operation which would be less costly performance wise (but same price if I understand it right). This is something we are aware of and will probably implement in the future in EMR. We can't offer a timeline though.

As stated before, the main problem here is that your table in DynamoDB is reaching the provisioned throughput so try to increase it temporarily for the import and then feel free to decrease it to whatever level you need.

This may sound a bit convenient but there was a problem with the alerts when you were doing this which was why you never received an alert. The problem has been fixed since.

Upvotes: 15

Related Questions