Petras K
Petras K

Reputation: 183

Speed up Where-Object comparison with CSV files

Is there a way to increase Where-Obeject comparison. Its Ok with 1000 records in both files but when files are more then 500k records its really slow.

function progressBar ($i, $totalItems) {
    Write-Progress -Activity "My Progress Bar" -Status "Doing stuff on $s" -PercentComplete ($i / $totalItems * 100)
}

$PSDefaultParameterValues['*:Encoding'] = 'utf8'

$f1 = (Import-Csv 'A.txt' -Delimiter 'µ' -Header samname, id)
$f2 = (Import-Csv 'B.txt' -Delimiter 'µ' -Header samname, id)

$counter = 0
$totalItems = $f1.Count

$f1 | ForEach-Object {
    $samname = $_.samname
    if ($m = $f2 | Where-Object {$_.samname -eq $samname}) {
        $_.id = $m.id
    } else {
        $_.id = $_.id
    }

    $counter += 1
    #Start-Sleep -s 3
    progressbar -i $counter -totalItems $totalItems
} 
$f1 | Export-Csv 'D.txt' -NoType

Upvotes: 1

Views: 1872

Answers (4)

iRon
iRon

Reputation: 23830

As you are violating the PowerShell pipeline technic anyways, I would consider to use the .Where({...}) (PSv4+) method which eats a lot of memory but is often faster.

If you really want to be fast on joining objects, you might consider to use Linq as in the https://github.com/ili101/Join-Object example from RamblingCookieMonster.

Upvotes: 0

Paweł Dyl
Paweł Dyl

Reputation: 9143

As Mathias R.Jessen said, you are iterating second recordset once for every record in first set. That should be replaced by faster algorithm - hash join or merge join if recordsets are sorted. Second, Write-Progress method is very slow. Try not to update progress every iteration. You can try this, tuned version:

#demo data
(1..100000 | % { "Name_$($_)µ$($_)" }) -join "`n" | out-file A.txt
(1..100000 | % { "Name_$($_)µ$($_)" }) -join "`n" | out-file B.txt

$f1=(import-csv 'A.txt' -Delimiter 'µ' -header samname,id)
$f2=(import-csv 'B.txt' -Delimiter 'µ' -header samname,id)

$dict = @{}
$f2 | % {$dict[$_.samname] = $_.id}

$f1| % {
  if ($dict.ContainsKey($_.samname)){
    $_.id=$dict[$_.samname].id
  } else {
    $_.id = $_.id
  }
  #update every 100 iterations
  if ($counter++%100 -eq 0) { progressbar -i $counter  -totalItems $totalItems }
}

Upvotes: 2

BlitzThunderWolf93
BlitzThunderWolf93

Reputation: 24

If you're trying to get the objects between two csv files that have the same "samname" property, you may want to consider using compare-object, like so:

compare-object -referenceobject $f1 -differenceobject $f2 -property samname -IncludeEqual

If you want to mess with those objects down the pipe, pass them through with -passthru

Upvotes: 0

Mathias R. Jessen
Mathias R. Jessen

Reputation: 174990

With the Where-Object clause inside the ForEach-Object statement, you're effectively creating a nested loop, meaning that the script is doing the comparison 250 billion times.

What I usually do is convert one of the arrays to a hashtable where the key is the value of the property you wish to compare, and then use that as a "lookup table" if you will.

Accessing a hashtable by key is super fast compared to iterating over the full array, so it'll be much faster than your current approach:

$f1=(import-csv 'A.txt' -Delimiter 'µ' -header samname,id)
$f2=(import-csv 'B.txt' -Delimiter 'µ' -header samname,id)

$h2 = @{}
$f2 |ForEach-Object {
  $h2[$_.samname] = $_
}

$f1 |Foreach-object{
  $samname=$_.samname
  if($h2.ContainsKey($samname)){
    $_.id = $h2[$samname].id
  }

  $counter += 1
  #Start-Sleep -s 3
  progressbar -i $counter  -totalItems $totalItems
}

Upvotes: 6

Related Questions