Lilienthal
Lilienthal

Reputation: 4378

Removing non-adjacent duplicates comparing all fields

What is the most (time) efficient way of removing all exact duplicates from an unsorted standard internal table (non-deep structure, arbitrarily large)?

All I can think of is simply sorting the entire thing by all of its fields before running DELETE ADJACENT DUPLICATES FROM itab COMPARING ALL FIELDS. Is there a faster or preferred alternative? Will this cause problems if the structure mixes alphanumeric fields with numerics?

To provide context, I'm trying to improve performance on some horrible select logic in legacy programs. Most of these run full table scans on 5-10 joined tables, some of them self-joining. I'm left with hundreds of thousands of rows in memory and I'm fairly sure a large portion of them are just duplicates. However, changing the actual selects is too complex and would require /ex[tp]ensive/ retesting. Just removing duplicates would likely cut runtime in half but I want to make sure that the deduplication doesn't add too much overhead itself.

Upvotes: 2

Views: 2796

Answers (1)

vwegert
vwegert

Reputation: 18483

I would investigate two methods:

  • Store the original index in an auxiliary field, SORT BY the fields you want to compare (possibly using STABLE), DELETE ADJACENT DUPLICATES, then re-SORT BY the stored index.

  • Using a HASHED TABLE for the fields you want to compare, LOOP through the data table. Use READ TABLE .. TRANSPORTING NO FIELDS on the hashed table to find out whether the value already existed and if so, remove it - otherwise add the values to the hashed table.

I'm not sure about the performance, but I would recommend to use SAT on a plausible data set for both methods and compare the results.

Upvotes: 4

Related Questions