SkyWalker
SkyWalker

Reputation: 29168

Scan vs Parallel Scan in AWS DynamoDB?

In Cloud storage system, AWS is highly demanded. Scan process need more faster. So how the scan process works and which one(Scan/Parallel Scan) is better in in which situation?

  1. How scan works in AWS DynamoDB?
  2. How parallel scan works in AWS DynamoDB?
  3. Scan vs Parallel Scan in AWS DyanmoDB?
  4. When Parallel Scan will be preferred?
  5. Is filter expression is applied before scan?

Upvotes: 8

Views: 17730

Answers (2)

F_SO_K
F_SO_K

Reputation: 14859

Addressing the question of when a Parallel Scan should be used over a regular Scan...

My experience is that a parallel scan is faster than a regular scan once you get above 2MB of data in a table, and roughly, you seem to optimise performance by running one segment per 1MB of data in the table.

I have three tables, each with on-demand provisioning. A Tiny table containing 300 items and 70KB of data. A small table containing 1,800 items and 4MB of data. And a large table containing 1.1 million items and 1.05GB of data.

I can time a regular scan by putting this command into a shell script called scan.sh

aws dynamodb scan --table-name MyTable --select COUNT

And then execute

time scan.sh

I can time a parallel scan by replacing the command in the shell script with

aws dynamodb scan --table-name MyTable --total-segments 4 --segment 0 --select COUNT

The above command runs the scan in 4 segments, and only executes one of the 4 segments. I use DynamoDBMapper (Java SDK) in my application, and the SDK takes cares of running the different threads in parallel.

On my tiny table, each scan took 1.4s, and running parallel scans made no difference. On my small table a regular scan took 1.8s and a parallel scan was optimal with 4 segments, running in 1.4s.

The interesting result was the large table. Here is time to execute the scan, based on the number of segments in a parallel scan:

  • 1 segment - 120 seconds
  • 4 segments 30 seconds
  • 8 segments 15 seconds
  • 16 segments 8 seconds
  • 32 segments 5 seconds
  • 64 segments 3 seconds
  • 128 segments 1.9s
  • 256 segments 1.6s
  • 512 segments - 1.4s
  • 1024 segments - 1.4s

Upvotes: 6

SkyWalker
SkyWalker

Reputation: 29168

1. How scan works in AWS DynamoDB?

Ans:

i) Scan operation returns one or more items.

ii) By default, Scan operations proceed sequentially.

iii) By default, Scan uses eventually consistent reads when accessing the data in a table.

iv) If the total number of scanned items exceeds the maximum data set size limit of 1 MB, the scan stops and results are returned to the user as a LastEvaluatedKey value to continue the scan in a subsequent operation.

v) A Scan operation performs eventually consistent reads by default, and it can return up to 1 MB (one page) of data. Therefore, a single Scan request can consume

(1 MB page size / 4 KB item size) / 2 (eventually consistent reads) = 128 read operations.

2. How parallel scan works in AWS DynamoDB?

Ans:

i) For faster performance on a large table or secondary index, applications can request a parallel Scan operation.

ii) You can run multiple worker threads or processes in parallel. Each worker will be able to scan a separate segment of a table concurently with the other workers. DynamoDB’s Scan function now accepts two additional parameters:

  • TotalSegments denotes the number of workers that will access the table concurrently.
  • Segment denotes the segment of table to be accessed by the calling worker.

iii) The two parameters, when used together, limit the scan to a particular block of items in the table. You can also use the existing Limit parameter to control how much data is returned by an individual Scan request.

3. Scan vs Parallel Scan in AWS DyanmoDB?

Ans:

i) A Scan operation can only read one partition at a time. So parallel scan is needed for faster read on multiple partition at a time.

ii) A sequential Scan might not always be able to fully utilize the provisioned read throughput capacity. So parallel scan is needed there.

iii) Parallel Scans, reduce your costs by up to 4x for certain types of queries and scans.

4. When Parallel Scan will be preferred?

Ans:

A parallel scan can be the right choice if the following conditions are met:

  • The table size is 20 GB or larger.

  • The table's provisioned read throughput is not being fully utilized.

  • Sequential Scan operations are too slow.

5. Is filter expression is applied before scan?

Ans: No, A FilterExpression is applied after the items have already been read; the process of filtering does not consume any additional read capacity units.

Resource Link:

  1. Scan

  2. Parallel Scan

  3. Example - Parallel Scan Using Java

  4. Amazon DynamoDB – Parallel Scans, 4x Cheaper Reads, Other Good News

  5. Avoid Sudden Bursts of Read Activity

Upvotes: 20

Related Questions