MarcGel
MarcGel

Reputation: 301

PowerShell Slowness Two Large CSV Files

I have this script working, but with 100k+ rows in File1 and 200k+ in file 2, it will take days to complete. I got the where.({ part down to less than a second, with both csv files as data tables, but with that route I can't get the data out the way I want. This script outputs the data the way I want, but it takes 4 seconds per lookup. What can I do to speed this up?

I thought containskey somewhere might help, but on PRACT_ID there is a one to many relationship, so not sure how to handle those? Thx.

Invoke-Expression "C:\SHC\MSO\DataTable\functionlibrary.ps1" 
[System.Data.DataTable]$Script:MappingTable = New-Object System.Data.DataTable
$File1 = Import-csv "C:\File1.csv" -Delimiter '|' | Sort-Object PRACT_ID
$File2 = Get-Content "C:\File2.csv" | Select-Object -Skip 1 | Sort-Object 
$Script:MappingTable = $File1 | Out-DataTable

$Logs = "C:\Testing1.7.csv" 

[System.Object]$UserOutput = @()

foreach ($name in $File1) {

    [string]$userMatch = $File2.Where( { $_.Split("|")[0] -eq $name.PRACT_ID })

    if ($userMatch) {
        # Process the data

        $UserOutput += New-Object PsObject -property @{
            ID_NUMBER                = $name.ID_NUMBER
            PRACT_ID                 = $name.PRACT_ID
            LAST_NAME                = $name.LAST_NAME
            FIRST_NAME               = $name.FIRST_NAME
            MIDDLE_INITIAL           = $name.MIDDLE_INITIAL
            DEGREE                   = $name.DEGREE
            EMAILADDRESS             = $name.EMAILADDRESS
            PRIMARY_CLINIC_PHONE     = $name.PRIMARY_CLINIC_PHONE
            SPECIALTY_NAME           = $name.SPECIALTY_NAME
            State_License            = $name.State_License
            NPI_Number               = $name.NPI_Number
            'University Affiliation' = $name.'University Affiliation'
            Teaching_Title           = $name.Teaching_Title
            FACILITY                 = $userMatch
        }
    }
}


$UserOutput | Select-Object ID_NUMBER, PRACT_ID, LAST_NAME, FIRST_NAME, MIDDLE_INITIAL, DEGREE, EMAILADDRESS, PRIMARY_CLINIC_PHONE, SPECIALTY_NAME, State_License, NPI_Number, 'University Affiliation', Teaching_Title, FACILITY |
Export-Csv $logs -NoTypeInformation 

Upvotes: 0

Views: 695

Answers (2)

thepip3r
thepip3r

Reputation: 2935

There are a multitude of ways to increase the speed of the operations you're doing and can be broken down to in-script and out-of-script-possibilities:

Out of Script Possibilities:

Since the files are large, how much memory does the machine you're running this on have? And are you maxing it out during this operation?

If you are paging to disk, this will be the single biggest impact to the overall process!

If you are, the two ways to address this are:

  1. Throw more hardware at the problem (easiest to deal with)
  2. Write your code to iterate over each file small chunks at a time so you don't load it all into RAM at the same time. (very difficult if you're not familiar with it)

In-Script:

  • Dont use @() with += (it's really slow (especially over large datasets))

Use instead an ArrayList. Here is a quick sample of the perf difference (ArrayList ~40x faster on 10,000 and 500x faster on 100,000 entries, consistently -- this difference gets larger as the dataset gets larger or in other words, @() += gets slower as the dataset gets bigger)):

(Measure-Command {
    $arr = [System.Collections.ArrayList]::new()

    1..100000 | % {
        [void]$arr.Add($_)
    }
}).TotalSeconds

(Measure-Command {
    $arr = @()
    1..100000 | % {
        $arr += $_
    }
}).TotalSeconds

0.8258113
451.5413987
  • If you need to do multiple key-based lookups on the data, iterating over the data millions of times will be slow. Import the data as a CSV and then structure a couple of hashtables with the associated information with key -> data and/or key -> data[] and then you can do index lookups instead of iterating through the arrays millions of times... will be MUCH faster; assuming you have available RAM for the extra objects..

EDIT for @RoadRunner:

My experience with GC may be old... it used to be horrendously slow on large files but appears in newer PowerShell versions, may have been fixed:

[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\10MB.txt", ('8' * 10MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\50MB.txt", ('8' * 50MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\100MB.txt", ('8' * 100MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\500MB.txt", ('8' * 500MB))

$10MB  = gi .\10MB.txt
$50MB  = gi .\50MB.txt
$100MB = gi .\100MB.txt
$500MB = gi .\500MB.txt

0..10 | % { 
    $n = [pscustomobject] @{
        'GC_10MB'    = (Measure-Command { Get-Content $10MB }).TotalSeconds
        'RAL_10MB'   = (Measure-Command { [System.IO.File]::ReadAllLines($10MB) }).TotalSeconds
        'GC_50MB'    = (Measure-Command { Get-Content $50MB }).TotalSeconds
        'RAL_50MB'   = (Measure-Command { [System.IO.File]::ReadAllLines($50MB) }).TotalSeconds
        'GC_100MB'   = (Measure-Command { Get-Content $100MB }).TotalSeconds
        'RAL_100MB'  = (Measure-Command { [System.IO.File]::ReadAllLines($100MB) }).TotalSeconds
        'GC_500MB'   = (Measure-Command { Get-Content $500MB }).TotalSeconds
        'RAL_500MB'  = (Measure-Command { [System.IO.File]::ReadAllLines($500MB) }).TotalSeconds
        'Delta_10MB'  = $null
        'Delta_50MB'  = $null
        'Delta_100MB' = $null
        'Delta_500MB' = $null
    }

    $n.Delta_10MB  = "{0:P}" -f ($n.GC_10MB / $n.RAL_10MB)
    $n.Delta_50MB  = "{0:P}" -f ($n.GC_50MB / $n.RAL_50MB)
    $n.Delta_100MB = "{0:P}" -f ($n.GC_100MB / $n.RAL_100MB)
    $n.Delta_500MB = "{0:P}" -f ($n.GC_500MB / $n.RAL_500MB)

    $n
}

Upvotes: 2

Mathias R. Jessen
Mathias R. Jessen

Reputation: 174990

Load $File2 into a hashtable with the $_.Split('|')[0] value as the key - you can then also skip the object creation completely and offload everything to Select-Object:

$File2 = Get-Content "C:\File2.csv" | Select-Object -Skip 1 | Sort-Object 

# load $file2 into hashtable
$userTable = @{}
foreach($userEntry in $File2){
    $userTable[$userEntry.Split('|')[0]] = $userEntry
}

# prepare the existing property names we want to preserve
$propertiesToSelect = 'ID_NUMBER', 'PRACT_ID', 'LAST_NAME', 'FIRST_NAME', 'MIDDLE_INITIAL', 'DEGREE', 'EMAILADDRESS', 'PRIMARY_CLINIC_PHONE', 'SPECIALTY_NAME', 'State_License', 'NPI_Number', 'University Affiliation', 'Teaching_Title'

# read file, filter on existence in $userTable, add the FACILITY calculated property before export
Import-csv "C:\File1.csv" -Delimiter '|' |Where-Object {$userTable.ContainsKey($_.PRACT_ID)} |Select-Object $propertiesToSelect,@{Name='FACILITY';Expression={$userTable[$_.PRACT_ID]}} |Export-Csv $logs -NoTypeInformation

Upvotes: 2

Related Questions