Reputation: 183
Is there a way to increase Where-Obeject
comparison. Its Ok with 1000 records in both files but when files are more then 500k records its really slow.
function progressBar ($i, $totalItems) {
Write-Progress -Activity "My Progress Bar" -Status "Doing stuff on $s" -PercentComplete ($i / $totalItems * 100)
}
$PSDefaultParameterValues['*:Encoding'] = 'utf8'
$f1 = (Import-Csv 'A.txt' -Delimiter 'µ' -Header samname, id)
$f2 = (Import-Csv 'B.txt' -Delimiter 'µ' -Header samname, id)
$counter = 0
$totalItems = $f1.Count
$f1 | ForEach-Object {
$samname = $_.samname
if ($m = $f2 | Where-Object {$_.samname -eq $samname}) {
$_.id = $m.id
} else {
$_.id = $_.id
}
$counter += 1
#Start-Sleep -s 3
progressbar -i $counter -totalItems $totalItems
}
$f1 | Export-Csv 'D.txt' -NoType
Upvotes: 1
Views: 1872
Reputation: 23830
As you are violating the PowerShell pipeline technic anyways, I would consider to use the .Where({...})
(PSv4+) method which eats a lot of memory but is often faster.
If you really want to be fast on joining objects, you might consider to use Linq
as in the https://github.com/ili101/Join-Object example from RamblingCookieMonster.
Upvotes: 0
Reputation: 9143
As Mathias R.Jessen said, you are iterating second recordset once for every record in first set. That should be replaced by faster algorithm - hash join or merge join if recordsets are sorted. Second, Write-Progress
method is very slow. Try not to update progress every iteration. You can try this, tuned version:
#demo data
(1..100000 | % { "Name_$($_)µ$($_)" }) -join "`n" | out-file A.txt
(1..100000 | % { "Name_$($_)µ$($_)" }) -join "`n" | out-file B.txt
$f1=(import-csv 'A.txt' -Delimiter 'µ' -header samname,id)
$f2=(import-csv 'B.txt' -Delimiter 'µ' -header samname,id)
$dict = @{}
$f2 | % {$dict[$_.samname] = $_.id}
$f1| % {
if ($dict.ContainsKey($_.samname)){
$_.id=$dict[$_.samname].id
} else {
$_.id = $_.id
}
#update every 100 iterations
if ($counter++%100 -eq 0) { progressbar -i $counter -totalItems $totalItems }
}
Upvotes: 2
Reputation: 24
If you're trying to get the objects between two csv files that have the same "samname" property, you may want to consider using compare-object, like so:
compare-object -referenceobject $f1 -differenceobject $f2 -property samname -IncludeEqual
If you want to mess with those objects down the pipe, pass them through with -passthru
Upvotes: 0
Reputation: 174990
With the Where-Object
clause inside the ForEach-Object
statement, you're effectively creating a nested loop, meaning that the script is doing the comparison 250 billion times.
What I usually do is convert one of the arrays to a hashtable where the key is the value of the property you wish to compare, and then use that as a "lookup table" if you will.
Accessing a hashtable by key is super fast compared to iterating over the full array, so it'll be much faster than your current approach:
$f1=(import-csv 'A.txt' -Delimiter 'µ' -header samname,id)
$f2=(import-csv 'B.txt' -Delimiter 'µ' -header samname,id)
$h2 = @{}
$f2 |ForEach-Object {
$h2[$_.samname] = $_
}
$f1 |Foreach-object{
$samname=$_.samname
if($h2.ContainsKey($samname)){
$_.id = $h2[$samname].id
}
$counter += 1
#Start-Sleep -s 3
progressbar -i $counter -totalItems $totalItems
}
Upvotes: 6