Reputation: 47

What is the fastest method of sorting out duplicate strings

I am developing an app that focuses around manipulating strings in various ways. One of which is removing any duplicate items while combining files. I have attempted to use this:

Private Sub run()
    For Each filePath As String In ListBox1.Items
        For Each Line In IO.File.ReadAllLines(filePath)
            Dim founds() As String = Line.Split(":")
            Dim hash As String = founds(0)
            Dim word As String = founds(1)
            foundList.Add(word)
            Dim result As List(Of String) = foundList.Distinct().ToList
            Label1.Text = result.Count
            For Each addstring In result
                ListBox2.Items.Add(addstring)
            Next
        Next
    Next
End Sub

Distinct was very slow in this fashion, so I tried using:

Private Sub run()
    For Each filePath As String In ListBox1.Items
        For Each Line In IO.File.ReadAllLines(filePath)
            Dim founds() As String = Line.Split(":")
            Dim hash As String = founds(0)
            Dim word As String = founds(1)

            If Not foundList.Contains(word) Then
                foundList.Add(word)
                Label1.Text = foundList.Count
            End If
        Next
    Next
    For Each found In foundList
        ListBox2.Items.Add(found)
    Next
End Sub

This was much faster however still performs slower than what should be possible without using OpenCL or similar. I can write in C# if there is anything different available but this in .NET. Can anyone suggest a faster or more effective method? This can't possibly be it, surely I am missing something.

Upvotes: 2

Answers (3)

Fabio

Reputation: 32443

Use HashSet(Of String)

Dim lines = IO.File.ReadAllLines(filePath)
Dim uniqueLines = New HashSet(Of String)(values)

After initialization HashSet will contains unique values.
You can use HashSet(Of T).Add(value) method - which return true if value was added to the set and false if value already exists in the set

Dim isAdded As Boolean = uniqueLines.Add("someValue")
If isAdded Then
    ' Do something if added
Else
    ' Do something if already exists
End if

HashSet have method Contains which algorithm is O(1) - use fixed amount of operations, where for example List.Contains method will iterate whole list until find given value (O(N) - amount of operation is equal amount of items in worth case)

So your function can be re-written as below

Private Sub run()
    ' get data
    Dim allItems = ListBox1.Items.
                            SelectMany(Function(path) IO.File.ReadAllLines(path)).
                            SelectMany(Function(line) Line.Split(":"))
    Dim uniqueItems = New HashSet(Of String)(allItems)

    ' update controls
    Label1.Text = uniqueItems.Count.ToString()
    ListBox2.Items.AddRange(uniqueItems.ToArray())
End Sub

Notice that items added to the ListBox2 by using .AddRange method. This method will add items in one operation and re-draw control only one time. Where when you adding items one by one (using .Add method) you control re-drawing itself for every added item, which can be "heavy" for big amount of items.

Upvotes: 1

Corey

Reputation: 16584

Fabio got to the obvious answer before I finished this. I've put some more detail in, but skip to the bottom for another idea.

The obvious speed issue is in the string comparison:

If Not foundList.Contains(word) Then

Comparing strings is a fairly expensive operation, and here you're comparing strings against a successively larger list of other strings. For a short list that might be OK, but when you're dealing with big lists it's going slow down somewhat.

The better option is to hash each string once then compare the hashes. There's some finesse required to handle hash collisions, but when the bulk of the data is unique and the hash function is good then you'll get a huge improvement in the speed.

In .NET the HashSet(Of T) class implements hash-based storage and lookup.

Being a Set, HashSet will only hold one of any particular value, which handles the duplication issue. It makes no guarantees about retaining the order of its contents, but in practice if you are adding only and never removing items then order is preserved.

Lookups in HashSet are very fast due to the way the hash values are stored and indexed internally. Testing to see if a value exists in the set is almost unaffected by the number of items in the list. I get lookup times on the order of ~50ns for lookups in lists from 1,000 to 1,000,000 strings with a simple test (100,000,000 lookups).

For your purposes the usage would be something like (in C#):

Private Sub Run()
    ' shortcut the file reads...
    Dim items = ListBox1.Items.OfType(Of String)()
        .SelectMany(Function(fn) File.ReadAllLines(fn))
        .Select(Function(i) i.Split(":"c)(1))
    Dim hash = New HashSet(Of String)(items)
    ListBox2.Items.Clear()
    ListBox2.Items.AddRange(hash.ToArray())
End Sub

(Sorry, VB is not my native language.)

Question is, will this actually speed things up? You'll need to do some testing of your own, but I suspect that the answer might be: not much.

Yes, it's faster than using Distinct and ToArray to get an array of sorted values. Almost twice as fast by my simple test. ~180ms vs ~275ms for a million distinct 36-character strings (yes, they're Guids) in an array. Not much of an increase. YMMV, but if the operation is taking significantly more time than that then the Distinct is probably not your biggest problem.

Do some profiling and find the actual pain point. I suspect that you'll find that ListBox2 has the Sorted flag set. Try this (again in C#, sorry):

Private Sub Run()
{
    Dim items = ListBox1.Items.OfType(Of String)().
        SelectMany(Function(fn) File.ReadAllLines(fn)).
        Select(Function(i) i.Split(":"c)(1))
    Dim hash = HashSet<string>(items)
    ListBox2.Items.Clear()
    Dim sorted = ListBox2.Sorted
    ListBox2.Sorted = false
    ListBox2.Items.AddRange(hash.ToArray())
    ListBox2.Sorted = sorted
}

If that's a lot faster then the problem isn't in the Distinct it's in the sort-on-insert which is painfully slow and almost always the worst option for sorted lists.

If it's not then the problem might be

Upvotes: 1

Slai

Reputation: 22886

Get all values before using Distinct and adding them (the slowest part is still updating the control):

ListBox2.DataSource = (From line In IO.File.ReadLines(filePath) 
                       Select line.Split(":"c)(1) Distinct).ToList

Upvotes: 1

What is the fastest method of sorting out duplicate strings

Answers (3)

Related Questions