How to speed up script and do not clog RAM (dir size +2mil files)

Question

I have written this novice code which:

Runs through master directory (+2mil files and many many sub-folders) filters .tmx files(around 12k files in m.dir) and extracts specific strings. and saves to .log file. After it is done there are sub procedures which clean up the log files and merges everything to 1 file. the problem is that i left script over-night and it just got stuck.

I recon it's due to dir size. I have previously listed all directory files to .txt is it possible that script would read .txt and would process 1 file at a time, maybe this way it wont take up 99% RAM after a while.

Also maybe you have other insights in speeding op or merging these procedures to one.

Get-ChildItem "MasterDirpath\*.tmx" -Recurse  | 
Foreach-Object {
$content = Get-Content $_.FullName

#filter and save content to the original file
#$content | Where-Object {$_ -match ')')} | %{$matches[1]} |Get-Unique  | 
Set-Content ($_.BaseName + '_out.log') 

}

Get-ChildItem "dir	ologs" -Filter *.log |
Foreach-Object {
$content = Get-Content -Raw $_.FullName
#make one line from extracted matches
$content -Replace "`r`n<" ,"`t<"  |Set-Content $_.FullName

}



Get-ChildItem "dir	ologs"  -Filter *.log | 
Foreach-Object {
$content = Get-Content $_.FullName

#filter and save content to the original file log file
$content | Where-Object {$_ -match '^.+$ '} | Sort | Get-Unique | Set-
Content $_.FullName
}


$path = "dir	ologs"
$out  = "dir	ologs\output.txt"

Get-ChildItem $path -Filter *.log | % {
$file = $_.Name
Get-Content $_.FullName | % {
    "${file}: $_" | Out-File -Append $out
}
}

UPDATE: Sample input

These .tmx files vary in size from 1Mb to 2Gb directory size is around 1Tb. And all files there can be from few Mb upto few Gb. Script runs fine on small directory with 50 tmx files 1-100mb.





 
 
 
 
 
hu
28807_Project_HU


**
 
 
 
 
no
no
node_data_en_final.xml
******
  <seg>Biomarkers and Integrated Solutions</seg>
  <seg>Novel therapeutic agents that have fast onset of action, good safety and tolerability profiles and that address common co-morbidities (for example, anxiety and substance abuse) <ph type='fmt'>{}</ph><it pos='begin'>&lt;ul&gt;</it></seg>
  <ul class="inline">

******
  <ul class="inline">

Output after procedure:

AABB-COR-09_Master_DE_out.log: 6293 SYB     
ABB-COR-09_Master_DE_out.log: AD        
ABB-COR-09_Master_DE_out.log: AGENTILE      
ABB-COR-09_Master_DE_out.log: ALIGN!        
ABB-COR-09_Master_DE_out.log: ANGELIKA      
ABB-COR-09_Master_DE_out.log: ASEDR

woxxom · Accepted Answer

Use one pipeline for the entire processing instead of 4 separate passes.
Use string operators like -join and -split instead of writing and reading from the same file
Use [regex] class and its Matches method to extract all tokens you want.

$RX_EXTRACT = [regex](
    '(?<=(creationid|changeid)=")[^"]+(?=")|' +
    ''
) # the unwanted parts are suppressed from the output via look-behind and look-ahead

Get-ChildItem (Join-Path $TMX_DIR *.tmx) -Recurse | ForEach {
    $_.FullName + ': ' + (
        ($RX_EXTRACT.Matches((Get-Content $_ -raw)).Value | Get-Unique
        ) -join "`n" -replace '
<', "`t<" -split "`n" -ne '' | Sort -Unique
    ) -join "`t"
} | Out-File "dir	ologs\output.txt"

Not tested extensively. Use it as an example.

How to speed up script and do not clog RAM (dir size +2mil files)

Answers (2)

Related Questions