Reputation: 4378
What is the most (time) efficient way of removing all exact duplicates from an unsorted standard internal table (non-deep structure, arbitrarily large)?
All I can think of is simply sorting the entire thing by all of its fields before running DELETE ADJACENT DUPLICATES FROM itab COMPARING ALL FIELDS
. Is there a faster or preferred alternative? Will this cause problems if the structure mixes alphanumeric fields with numerics?
To provide context, I'm trying to improve performance on some horrible select logic in legacy programs. Most of these run full table scans on 5-10 joined tables, some of them self-joining. I'm left with hundreds of thousands of rows in memory and I'm fairly sure a large portion of them are just duplicates. However, changing the actual selects is too complex and would require /ex[tp]ensive/ retesting. Just removing duplicates would likely cut runtime in half but I want to make sure that the deduplication doesn't add too much overhead itself.
Upvotes: 2
Views: 2796
Reputation: 18483
I would investigate two methods:
Store the original index in an auxiliary field, SORT BY
the fields you want to compare (possibly using STABLE
), DELETE ADJACENT DUPLICATES
, then re-SORT BY
the stored index.
Using a HASHED TABLE
for the fields you want to compare, LOOP
through the data table. Use READ TABLE .. TRANSPORTING NO FIELDS
on the hashed table to find out whether the value already existed and if so, remove it - otherwise add the values to the hashed table.
I'm not sure about the performance, but I would recommend to use SAT
on a plausible data set for both methods and compare the results.
Upvotes: 4