Reputation: 29168
In Cloud storage system, AWS is highly demanded. Scan process need more faster. So how the scan process works and which one(Scan/Parallel Scan) is better in in which situation?
Upvotes: 8
Views: 17730
Reputation: 14859
Addressing the question of when a Parallel Scan should be used over a regular Scan...
My experience is that a parallel scan is faster than a regular scan once you get above 2MB of data in a table, and roughly, you seem to optimise performance by running one segment per 1MB of data in the table.
I have three tables, each with on-demand provisioning. A Tiny table containing 300 items and 70KB of data. A small table containing 1,800 items and 4MB of data. And a large table containing 1.1 million items and 1.05GB of data.
I can time a regular scan by putting this command into a shell script called scan.sh
aws dynamodb scan --table-name MyTable --select COUNT
And then execute
time scan.sh
I can time a parallel scan by replacing the command in the shell script with
aws dynamodb scan --table-name MyTable --total-segments 4 --segment 0 --select COUNT
The above command runs the scan in 4 segments, and only executes one of the 4 segments. I use DynamoDBMapper (Java SDK) in my application, and the SDK takes cares of running the different threads in parallel.
On my tiny table, each scan took 1.4s, and running parallel scans made no difference. On my small table a regular scan took 1.8s and a parallel scan was optimal with 4 segments, running in 1.4s.
The interesting result was the large table. Here is time to execute the scan, based on the number of segments in a parallel scan:
Upvotes: 6
Reputation: 29168
1. How scan works in AWS DynamoDB?
Ans:
i) Scan operation returns one or more items.
ii) By default, Scan operations proceed sequentially.
iii) By default, Scan uses eventually consistent reads when accessing the data in a table.
iv) If the total number of scanned items exceeds the maximum data set size limit of 1 MB, the scan stops and results are returned to the user as a LastEvaluatedKey value to continue the scan in a subsequent operation.
v) A Scan operation performs eventually consistent reads by default, and it can return up to 1 MB (one page) of data. Therefore, a single Scan request can consume
(1 MB page size / 4 KB item size) / 2 (eventually consistent reads) = 128 read operations.
2. How parallel scan works in AWS DynamoDB?
Ans:
i) For faster performance on a large table or secondary index, applications can request a parallel Scan operation.
ii) You can run multiple worker threads or processes in parallel. Each worker will be able to scan a separate segment of a table concurently with the other workers. DynamoDB’s Scan function now accepts two additional parameters:
iii) The two parameters, when used together, limit the scan to a particular block of items in the table. You can also use the existing Limit parameter to control how much data is returned by an individual Scan request.
3. Scan vs Parallel Scan in AWS DyanmoDB?
Ans:
i) A Scan operation can only read one partition at a time. So parallel scan is needed for faster read on multiple partition at a time.
ii) A sequential Scan might not always be able to fully utilize the provisioned read throughput capacity. So parallel scan is needed there.
iii) Parallel Scans, reduce your costs by up to 4x for certain types of queries and scans.
4. When Parallel Scan will be preferred?
Ans:
A parallel scan can be the right choice if the following conditions are met:
The table size is 20 GB or larger.
The table's provisioned read throughput is not being fully utilized.
Sequential Scan operations are too slow.
5. Is filter expression is applied before scan?
Ans: No, A FilterExpression is applied after the items have already been read; the process of filtering does not consume any additional read capacity units.
Upvotes: 20