Shahram Banazadeh
Shahram Banazadeh

Reputation: 510

Removing duplicate lines from TStringList without sorting in Delphi

I know how to remove duplicate strings from a TStringList using dupignore for a sorted Tstringlist.

CallData := TStringList.Create;
CallData.Sorted := True;
Call.Duplicates := dupIgnore;

But in my case strings must not be sorted .

Using a FOR loop finding duplicates is very slow (also using indexOF())when TStringList has hundreds of thousands of lines .

 if OpenDialog1.Execute then
  begin
    Try
      y := TStringList.create;
      f := TStreamReader.create(OpenDialog1.FileName, TEncoding.UTF8, True);
      while not f.EndOfStream do
      begin
        l := f.ReadLine;
        X.Add(l);
      end;

      g := Tstreamwriter.create('d:\logX.txt', True, TEncoding.UTF8);
      for I := 0 to X.count - 1 do
      begin


          if y.IndexOf(X[I]) = -1 then

          y.Add(X[I]);

      end;

      for j := 0 to y.count - 1 do
        g.WriteLine(y[j]);

    Finally
      f.free;
      y.free;
      g.free;
    End;
  end;

is there any better way ?

Upvotes: 7

Views: 3182

Answers (3)

David Heffernan
David Heffernan

Reputation: 612924

Here's how I would approach this problem:

  1. Create a dictionary keyed on a string. It doesn't matter that the value type is.
  2. Iterate through the string list.
  3. For each string, check whether or not it is in the dictionary.
  4. If it is in the dictionary, remove from the string list. Otherwise add to the dictionary.

If there are a large number of duplicates to be removed then the performance of the above will be affected by repeated removal from the string list. That's because each item to be removed results in the later items being shifted down one index. You can avoid this by copying into a new list rather than deleting inplace.

Alternatively, you can operate in place like this:

  1. Create a dictionary keyed on a string. It doesn't matter that the value type is.
  2. Initialise a variable named Count to zero.
  3. Iterate through the string list in forward order.
  4. For each string, check whether or not it is in the dictionary.
  5. If it is in the dictionary, do nothing. Otherwise add to the dictionary, copy into index Count of the list, and then increment Count.
  6. Once the iteration is complete, resize the list to have Count elements.

The point of the dictionary is that lookup is an O(1) operation and so the second algorithm has O(n) time complexity.

Upvotes: 6

Magoo
Magoo

Reputation: 80013

function compareobjects
          (list     : Tstringlist;
           index1   : integer;
           index2   : integer
          )         : integer;
begin
  if index1 = index2 then
    result := 0
  else
    if integer(list.objects[index1]) < integer(list.objects[index2]) then
      result := -1
    else
      result := 1;
end;

begin
  Try
    y := TStringList.create;
    y.Sorted := true;
    y.Duplicates := dupignore;
    f := TStreamReader.create('c:\106x\q47780823.bat');
    i := 0;
    while not f.EndOfStream do
    begin
      inc(i);
      line := f.readline;
      y.Addobject(line,tobject(i));
    end;
    y.Sorted := false;
    y.CustomSort(compareobjects);

    for i := 0 to y.count - 1 do
      WriteLn(y[i]);

    Finally
      f.free;
      y.free;
  End;
  readln;
end.

I'd keep track of the line number (i) and assign it with the string by casting as an object; sort the list and remove duplicates as before, but then un-sort it using a custom sort on the objects.

Upvotes: 1

Dsm
Dsm

Reputation: 6013

I would use trickery, by having a sorted and an unsorted list. Like this:

  y := TStringList.create;
  s := TStringList.create;
  s.Sorted := TRUE;
  s.Duplicates := dupIgnore;

  f := TStreamReader.create(OpenDialog1.FileName, TEncoding.UTF8, True);
  while not f.EndOfStream do
  begin
    l := f.ReadLine;
    s.Add(l);
    if s.Count > y.Count then y.Add(l);
  end;

  // etc.

Upvotes: 3

Related Questions