Jens Mühlenhoff
Jens Mühlenhoff

Reputation: 14873

How to remove all duplicates from a list?

Consider this test app:

function RemoveDuplicates(const Input: IEnumerable<Integer>): IEnumerable<Integer>;
begin
  // How to implement this function?
end;

var
  Enumerable: IEnumerable<Integer>;
  UniqueEnumerable: IEnumerable<Integer>;
begin
  Enumerable := TCollections.CreateList<Integer>([1, 1, 2, 3, 3, 3, 4]);
  UniqueEnumerable := RemoveDuplicates(Enumerable);
  UniqueEnumerable.ForEach(
    procedure(const I: Integer)
    begin
      WriteLn(I);
    end);
  ReadLn;
end.

How can I implement the RemoveDuplicates function (this is called nub in Haskell)?

Upvotes: 6

Views: 1675

Answers (4)

Jens M&#252;hlenhoff
Jens M&#252;hlenhoff

Reputation: 14873

Using an intermediate list:

function RemoveDuplicates(const Input: IEnumerable<Integer>): IEnumerable<Integer>;
var
  List: IList<Integer>;
begin
  List := TCollections.CreateList<Integer>;
  Input.ForEach(
    procedure(const I: Integer)
    begin
      if not List.Contains(I) then
        List.Add(I);
    end);
  Result := List;
end;

This is obviously not the best performing solution, see the other answers for better alternatives.

Upvotes: 0

Johan
Johan

Reputation: 76557

Jens's solution will work, but it has a rather slow running time of O(n2).

A better alternative if you have a long list is to
- Sort the list
- Compare every item with its successor.

This has a running time of O(n log n) for the quicksort + O(n) for the search for a total running time of O(n log n).

See the following pseudo code (don't have access to Delphi now).

function RemoveDuplicates(const Input: IEnumerable<Integer>): IEnumerable<Integer>;
var
  List: IList<Integer>;
  i: integer;
begin
  List := TCollections.CreateList<Integer>;
  List.Assign(Input); //Copy input list to output.
  List.Sort;
  for i:= List.Count-1 downto 1 do begin
    if List[i] = List[i-1] then List.delete(i); 
    //if Comparer<T>.Equals(List[i], List[i-1]) then ....
  end; {for i}
end;

Problems
The problem with this approach is that the output (might) have a different order from the input. This may or may not be a problem.

Benefits (or why the dictionary sucks)
If the sorting is a cheap operation this will be the fastest approach.
The use of a dictionary carries high constant cost for the hashing.
Even though the hashing operation is O(1), it can get very expensive for large keys, because the hash will always process the whole key, whereas sorting comparison will stop as soon a a difference is detected. Further note that hashing is a much more expensive operation than a simple comparision (about 30x to 100x slower)!

Only when the list is huge does the better asymptotic running time of the dictonary kick in.

Upvotes: 4

Stefan Glienke
Stefan Glienke

Reputation: 21713

Use what is already there:

uses
  Spring.Collections,
  Spring.collections.Extensions;

function RemoveDuplicates(const Input: IEnumerable<Integer>): IEnumerable<Integer>;
begin
  Result := TDistinctIterator<Integer>.Create(Input, nil);
end;

This supports lazy evaluation (means that Input is not getting processed before the resulting enumerable is getting processed). It is using a hashset (currently implemented as Dictionary) internally to keep track of the already found items (this happens inside the enumerator).

Why is that important? Because any operation that does a complete enumeration might cause unwanted performance impact if Input involves other costly operations which may by far outweigh any benefits of other approaches of removing duplicates (like putting it into a list and sort that). Also an IEnumerable is not guaranteed to be finite.

If between calling this function and enumerating the result the Input was changed that change is affecting the outcome of your enumeration whereas it would not be the case if you are not supporting lazy evaluation. If you are enumerating multiple times the outcome might be different (i.e. up-to-date) each time.

Upvotes: 12

Wosi
Wosi

Reputation: 45223

For performance reasons I suggest to use a sorted list dictionary.

function RemoveDuplicates(const Input: IEnumerable<Integer>): IEnumerable<Integer>;
var
  Dictionary: IDictionary<integer, integer>;
  Item: integer;
begin
  Dictionary := TCollections.CreateDictionary<integer,integer>;
  for Item in Input do
    Dictionary.AddOrSetValue(Item, 0);     

  Result := Dictionary.Keys;
end;

Upvotes: 3

Related Questions