Reputation: 301
I have this script working, but with 100k+ rows in File1 and 200k+ in file 2, it will take days to complete. I got the where.({ part down to less than a second, with both csv files as data tables, but with that route I can't get the data out the way I want. This script outputs the data the way I want, but it takes 4 seconds per lookup. What can I do to speed this up?
I thought containskey somewhere might help, but on PRACT_ID there is a one to many relationship, so not sure how to handle those? Thx.
Invoke-Expression "C:\SHC\MSO\DataTable\functionlibrary.ps1"
[System.Data.DataTable]$Script:MappingTable = New-Object System.Data.DataTable
$File1 = Import-csv "C:\File1.csv" -Delimiter '|' | Sort-Object PRACT_ID
$File2 = Get-Content "C:\File2.csv" | Select-Object -Skip 1 | Sort-Object
$Script:MappingTable = $File1 | Out-DataTable
$Logs = "C:\Testing1.7.csv"
[System.Object]$UserOutput = @()
foreach ($name in $File1) {
[string]$userMatch = $File2.Where( { $_.Split("|")[0] -eq $name.PRACT_ID })
if ($userMatch) {
# Process the data
$UserOutput += New-Object PsObject -property @{
ID_NUMBER = $name.ID_NUMBER
PRACT_ID = $name.PRACT_ID
LAST_NAME = $name.LAST_NAME
FIRST_NAME = $name.FIRST_NAME
MIDDLE_INITIAL = $name.MIDDLE_INITIAL
DEGREE = $name.DEGREE
EMAILADDRESS = $name.EMAILADDRESS
PRIMARY_CLINIC_PHONE = $name.PRIMARY_CLINIC_PHONE
SPECIALTY_NAME = $name.SPECIALTY_NAME
State_License = $name.State_License
NPI_Number = $name.NPI_Number
'University Affiliation' = $name.'University Affiliation'
Teaching_Title = $name.Teaching_Title
FACILITY = $userMatch
}
}
}
$UserOutput | Select-Object ID_NUMBER, PRACT_ID, LAST_NAME, FIRST_NAME, MIDDLE_INITIAL, DEGREE, EMAILADDRESS, PRIMARY_CLINIC_PHONE, SPECIALTY_NAME, State_License, NPI_Number, 'University Affiliation', Teaching_Title, FACILITY |
Export-Csv $logs -NoTypeInformation
Upvotes: 0
Views: 695
Reputation: 2935
There are a multitude of ways to increase the speed of the operations you're doing and can be broken down to in-script and out-of-script-possibilities:
Out of Script Possibilities:
Since the files are large, how much memory does the machine you're running this on have? And are you maxing it out during this operation?
If you are paging to disk, this will be the single biggest impact to the overall process!
If you are, the two ways to address this are:
In-Script:
Use instead an ArrayList. Here is a quick sample of the perf difference (ArrayList ~40x faster on 10,000 and 500x faster on 100,000 entries, consistently -- this difference gets larger as the dataset gets larger or in other words, @() +=
gets slower as the dataset gets bigger)):
(Measure-Command {
$arr = [System.Collections.ArrayList]::new()
1..100000 | % {
[void]$arr.Add($_)
}
}).TotalSeconds
(Measure-Command {
$arr = @()
1..100000 | % {
$arr += $_
}
}).TotalSeconds
0.8258113
451.5413987
key -> data
and/or key -> data[]
and then you can do index lookups instead of iterating through the arrays millions of times... will be MUCH faster; assuming you have available RAM for the extra objects..EDIT for @RoadRunner:
My experience with GC may be old... it used to be horrendously slow on large files but appears in newer PowerShell versions, may have been fixed:
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\10MB.txt", ('8' * 10MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\50MB.txt", ('8' * 50MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\100MB.txt", ('8' * 100MB))
[System.IO.File]::WriteAllLines("$($Env:UserProfile)\Desktop\500MB.txt", ('8' * 500MB))
$10MB = gi .\10MB.txt
$50MB = gi .\50MB.txt
$100MB = gi .\100MB.txt
$500MB = gi .\500MB.txt
0..10 | % {
$n = [pscustomobject] @{
'GC_10MB' = (Measure-Command { Get-Content $10MB }).TotalSeconds
'RAL_10MB' = (Measure-Command { [System.IO.File]::ReadAllLines($10MB) }).TotalSeconds
'GC_50MB' = (Measure-Command { Get-Content $50MB }).TotalSeconds
'RAL_50MB' = (Measure-Command { [System.IO.File]::ReadAllLines($50MB) }).TotalSeconds
'GC_100MB' = (Measure-Command { Get-Content $100MB }).TotalSeconds
'RAL_100MB' = (Measure-Command { [System.IO.File]::ReadAllLines($100MB) }).TotalSeconds
'GC_500MB' = (Measure-Command { Get-Content $500MB }).TotalSeconds
'RAL_500MB' = (Measure-Command { [System.IO.File]::ReadAllLines($500MB) }).TotalSeconds
'Delta_10MB' = $null
'Delta_50MB' = $null
'Delta_100MB' = $null
'Delta_500MB' = $null
}
$n.Delta_10MB = "{0:P}" -f ($n.GC_10MB / $n.RAL_10MB)
$n.Delta_50MB = "{0:P}" -f ($n.GC_50MB / $n.RAL_50MB)
$n.Delta_100MB = "{0:P}" -f ($n.GC_100MB / $n.RAL_100MB)
$n.Delta_500MB = "{0:P}" -f ($n.GC_500MB / $n.RAL_500MB)
$n
}
Upvotes: 2
Reputation: 174990
Load $File2
into a hashtable with the $_.Split('|')[0]
value as the key - you can then also skip the object creation completely and offload everything to Select-Object
:
$File2 = Get-Content "C:\File2.csv" | Select-Object -Skip 1 | Sort-Object
# load $file2 into hashtable
$userTable = @{}
foreach($userEntry in $File2){
$userTable[$userEntry.Split('|')[0]] = $userEntry
}
# prepare the existing property names we want to preserve
$propertiesToSelect = 'ID_NUMBER', 'PRACT_ID', 'LAST_NAME', 'FIRST_NAME', 'MIDDLE_INITIAL', 'DEGREE', 'EMAILADDRESS', 'PRIMARY_CLINIC_PHONE', 'SPECIALTY_NAME', 'State_License', 'NPI_Number', 'University Affiliation', 'Teaching_Title'
# read file, filter on existence in $userTable, add the FACILITY calculated property before export
Import-csv "C:\File1.csv" -Delimiter '|' |Where-Object {$userTable.ContainsKey($_.PRACT_ID)} |Select-Object $propertiesToSelect,@{Name='FACILITY';Expression={$userTable[$_.PRACT_ID]}} |Export-Csv $logs -NoTypeInformation
Upvotes: 2