Reputation: 47
I am developing an app that focuses around manipulating strings in various ways. One of which is removing any duplicate items while combining files. I have attempted to use this:
Private Sub run()
For Each filePath As String In ListBox1.Items
For Each Line In IO.File.ReadAllLines(filePath)
Dim founds() As String = Line.Split(":")
Dim hash As String = founds(0)
Dim word As String = founds(1)
foundList.Add(word)
Dim result As List(Of String) = foundList.Distinct().ToList
Label1.Text = result.Count
For Each addstring In result
ListBox2.Items.Add(addstring)
Next
Next
Next
End Sub
Distinct was very slow in this fashion, so I tried using:
Private Sub run()
For Each filePath As String In ListBox1.Items
For Each Line In IO.File.ReadAllLines(filePath)
Dim founds() As String = Line.Split(":")
Dim hash As String = founds(0)
Dim word As String = founds(1)
If Not foundList.Contains(word) Then
foundList.Add(word)
Label1.Text = foundList.Count
End If
Next
Next
For Each found In foundList
ListBox2.Items.Add(found)
Next
End Sub
This was much faster however still performs slower than what should be possible without using OpenCL or similar. I can write in C# if there is anything different available but this in .NET. Can anyone suggest a faster or more effective method? This can't possibly be it, surely I am missing something.
Upvotes: 2
Views: 249
Reputation: 32443
Use HashSet(Of String)
Dim lines = IO.File.ReadAllLines(filePath)
Dim uniqueLines = New HashSet(Of String)(values)
After initialization HashSet
will contains unique values.
You can use HashSet(Of T).Add(value)
method - which return true
if value was added to the set and false
if value already exists in the set
Dim isAdded As Boolean = uniqueLines.Add("someValue")
If isAdded Then
' Do something if added
Else
' Do something if already exists
End if
HashSet
have method Contains
which algorithm is O(1) - use fixed amount of operations, where for example List.Contains
method will iterate whole list until find given value (O(N) - amount of operation is equal amount of items in worth case)
So your function can be re-written as below
Private Sub run()
' get data
Dim allItems = ListBox1.Items.
SelectMany(Function(path) IO.File.ReadAllLines(path)).
SelectMany(Function(line) Line.Split(":"))
Dim uniqueItems = New HashSet(Of String)(allItems)
' update controls
Label1.Text = uniqueItems.Count.ToString()
ListBox2.Items.AddRange(uniqueItems.ToArray())
End Sub
Notice that items added to the ListBox2
by using .AddRange
method. This method will add items in one operation and re-draw control only one time. Where when you adding items one by one (using .Add
method) you control re-drawing itself for every added item, which can be "heavy" for big amount of items.
Upvotes: 1
Reputation: 16584
Fabio got to the obvious answer before I finished this. I've put some more detail in, but skip to the bottom for another idea.
The obvious speed issue is in the string comparison:
If Not foundList.Contains(word) Then
Comparing strings is a fairly expensive operation, and here you're comparing strings against a successively larger list of other strings. For a short list that might be OK, but when you're dealing with big lists it's going slow down somewhat.
The better option is to hash each string once then compare the hashes. There's some finesse required to handle hash collisions, but when the bulk of the data is unique and the hash function is good then you'll get a huge improvement in the speed.
In .NET the HashSet(Of T)
class implements hash-based storage and lookup.
Being a Set, HashSet
will only hold one of any particular value, which handles the duplication issue. It makes no guarantees about retaining the order of its contents, but in practice if you are adding only and never removing items then order is preserved.
Lookups in HashSet
are very fast due to the way the hash values are stored and indexed internally. Testing to see if a value exists in the set is almost unaffected by the number of items in the list. I get lookup times on the order of ~50ns for lookups in lists from 1,000 to 1,000,000 strings with a simple test (100,000,000 lookups).
For your purposes the usage would be something like (in C#):
Private Sub Run()
' shortcut the file reads...
Dim items = ListBox1.Items.OfType(Of String)()
.SelectMany(Function(fn) File.ReadAllLines(fn))
.Select(Function(i) i.Split(":"c)(1))
Dim hash = New HashSet(Of String)(items)
ListBox2.Items.Clear()
ListBox2.Items.AddRange(hash.ToArray())
End Sub
(Sorry, VB is not my native language.)
Question is, will this actually speed things up? You'll need to do some testing of your own, but I suspect that the answer might be: not much.
Yes, it's faster than using Distinct
and ToArray
to get an array of sorted values. Almost twice as fast by my simple test. ~180ms vs ~275ms for a million distinct 36-character strings (yes, they're Guids) in an array. Not much of an increase. YMMV, but if the operation is taking significantly more time than that then the Distinct
is probably not your biggest problem.
Do some profiling and find the actual pain point. I suspect that you'll find that ListBox2
has the Sorted
flag set. Try this (again in C#, sorry):
Private Sub Run()
{
Dim items = ListBox1.Items.OfType(Of String)().
SelectMany(Function(fn) File.ReadAllLines(fn)).
Select(Function(i) i.Split(":"c)(1))
Dim hash = HashSet<string>(items)
ListBox2.Items.Clear()
Dim sorted = ListBox2.Sorted
ListBox2.Sorted = false
ListBox2.Items.AddRange(hash.ToArray())
ListBox2.Sorted = sorted
}
If that's a lot faster then the problem isn't in the Distinct
it's in the sort-on-insert which is painfully slow and almost always the worst option for sorted lists.
If it's not then the problem might be
Upvotes: 1
Reputation: 22886
Get all values before using Distinct
and adding them (the slowest part is still updating the control):
ListBox2.DataSource = (From line In IO.File.ReadLines(filePath)
Select line.Split(":"c)(1) Distinct).ToList
Upvotes: 1