Reputation: 9591
I have a powershell script which uses Compare-Object
to diff/compare a list of MD5 checksum's against each-other ... how can I speed this up? its been running for hours!
$diffmd5 =
(Compare-Object -ReferenceObject $localmd5 -DifferenceObject $remotefilehash |
Where-Object { ($_.SideIndicator -eq '=>') } |
Select-Object -ExpandProperty InputObject)
Upvotes: 2
Views: 3578
Reputation: 439193
Compare-Object
is convenient, but indeed slow; also avoiding the pipeline altogether is important for maximizing performance.
I suggest using a [System.Collections.Generic.HashSet[T]
instance, which supports high-performance lookups in a set of unordered[1] values:[2]
# Two sample arrays
$localmd5 = 'Foo1', 'Bar1', 'Baz1'
$remotefilehash = 'foo1', 'bar1', 'bar2', 'baz1', 'more'
# Create a hash set from the local hashes.
# Make lookups case-*insensitive*.
# Note: Strongly typing the input array ([string[]]) is a must.
$localHashSet = [System.Collections.Generic.HashSet[string]]::new(
[string[]] $localmd5,
[System.StringComparer]::OrdinalIgnoreCase
)
# Loop over all remote hashes to find those not among the local hashes.
$remotefilehash.Where({ -not $localHashSet.Contains($_) })
The above yields collection 'bar2', 'more'
.
Note that if case-sensitive lookups are sufficient, which is the default (for string elements), a simple cast is sufficient to construct the hash set:
$localHashSet = [System.Collections.Generic.HashSet[string]] $localmd5
Note: Your later feedback states that $remotefilehash
is a hashtable(-like) collection of key-value pairs rather than a collection of mere file-hash strings, in which the keys store the hash strings. In that case:
To find just the differing hash strings (note the .Keys
property access to get the array of key values):
$remotefilehash.Keys.Where({ -not $localHashSet.Contains($_) })
To find those key-value pairs whose keys are not in the hash set (note the .GetEnumerator()
call to enumerate all entries (key-value pairs)):
$remotefilehash.GetEnumerator().Where({ -not $localHashSet.Contains($_.Key) })
Alternatively, if the input collections are (a) of the same size and (b) have corresponding elements (that is, element 1 from one collection should be compared to element 1 from the other, and so on), using Compare-Object
with -SyncWindow 0
, as shown in js2010's helpful answer, with subsequent .SideIndicator
filtering may be an option; to speed up the operation, the -PassThru
switch should be used, which forgoes wrapping the differing objects in [pscustomobject]
instances (the .SideIndicator
property is then added as a NoteProperty
member directly to the differing objects).
[1] There is a related type for maintaining sorted values, System.Collections.Generic.SortedSet[T]
, but - as of .NET 6 - no built-in type for maintaining values in input order, though you can create your own type by deriving from [System.Collections.ObjectModel.KeyedCollection[TKey, TItem]]
[2] Note that a hash set - unlike a hash table - has no values associated with its entries. A hash set is "all keys", if you will - all it supports is testing for the presence of a key == value.
Upvotes: 5
Reputation: 27491
By default, compare-object compares every element in the first array with every element in the second array (up to about 2 billion positions), so the order doesn't matter, but large lists would be very slow. -syncwindow 0 would be much faster but would require matches to be in the same exact positions:
Compare-Object $localmd5 $remotefilehash -syncwindow 0
As a simple demo of -syncwindow:
compare-object 1,2,3 1,3,2 -SyncWindow 0 # not equal
InputObject SideIndicator
----------- -------------
3 =>
2 <=
2 =>
3 <=
compare-object 1,2,3 1,3,2 -SyncWindow 1 # equal
compare-object 1,2,3 1,2,3 -SyncWindow 0 # equal
Upvotes: 2
Reputation: 60518
I feel this should be faster than Compare-Object
$result = [system.collections.generic.list[string]]::new()
foreach($hash in $remotefilehash)
{
if(-not($localmd5.contains($hash)))
{
$result.Add($hash)
}
}
The problem here is that .contains
method is case sensitive, I believe all MD5 hashes have uppercase letters but if this was not the case you would need to call the .toupper()
or .tolower()
methods to normalize the arrays.
Upvotes: 0