Petras K
Petras K

Reputation: 173

How to speed up script and do not clog RAM (dir size +2mil files)

I have written this novice code which:

Runs through master directory (+2mil files and many many sub-folders) filters .tmx files(around 12k files in m.dir) and extracts specific strings. and saves to .log file. After it is done there are sub procedures which clean up the log files and merges everything to 1 file. the problem is that i left script over-night and it just got stuck.

I recon it's due to dir size. I have previously listed all directory files to .txt is it possible that script would read .txt and would process 1 file at a time, maybe this way it wont take up 99% RAM after a while.

Also maybe you have other insights in speeding op or merging these procedures to one.

Get-ChildItem "MasterDirpath\*.tmx" -Recurse  | 
Foreach-Object {
$content = Get-Content $_.FullName

#filter and save content to the original file
#$content | Where-Object {$_ -match '<tu '} | Set-Content $_.FullName

#filter and save content to a new file 
$content | Where-Object {!($_ -match '(?:creationid|changeid)="([^"]+)"' -or 
$_ -match '(<tuv.+?lang="[A-Za-z\-]+">)')} | %{$matches[1]} |Get-Unique  | 
Set-Content ($_.BaseName + '_out.log') 

}

Get-ChildItem "dir\tologs" -Filter *.log |
Foreach-Object {
$content = Get-Content -Raw $_.FullName
#make one line from extracted matches
$content -Replace "`r`n<" ,"`t<"  |Set-Content $_.FullName

}



Get-ChildItem "dir\tologs"  -Filter *.log | 
Foreach-Object {
$content = Get-Content $_.FullName

#filter and save content to the original file log file
$content | Where-Object {$_ -match '^.+$ '} | Sort | Get-Unique | Set-
Content $_.FullName
}


$path = "dir\tologs"
$out  = "dir\tologs\output.txt"

Get-ChildItem $path -Filter *.log | % {
$file = $_.Name
Get-Content $_.FullName | % {
    "${file}: $_" | Out-File -Append $out
}
}

UPDATE: Sample input

These .tmx files vary in size from 1Mb to 2Gb directory size is around 1Tb. And all files there can be from few Mb upto few Gb. Script runs fine on small directory with 50 tmx files 1-100mb.

<?xml version="1.0" encoding="utf-16"?>
<!DOCTYPE tmx SYSTEM "tmx14.dtd">
<tmx version="1.4">
<header creationtool="MemoQ" creationtoolversion="7.0.68" segtype="sentence" 
adminlang="en-us" creationid="lsmall" srclang="en-us" o-tmf="MemoQTM" 
datatype="unknown">
<prop type="defclient"> </prop>
<prop type="defproject"> </prop>
<prop type="defdomain"> </prop>
<prop type="defsubject"> </prop>
<prop type="description"> </prop>
<prop type="targetlang">hu</prop>
<prop type="name">28807_Project_HU</prop>
</header>
<body>
<tu changedate="20151104T174128Z" creationdate="20150929T180844Z" **creationid="pmccrory"** **changeid="lsmall">**
<prop type="client"> </prop>
<prop type="project"> </prop>
<prop type="domain"> </prop>
<prop type="subject"> </prop>
<prop type="corrected">no</prop>
<prop type="aligned">no</prop>
<prop type="x-document">node_data_en_final.xml</prop>
***<tuv xml:lang="en-us">***
  <prop type="x-context-pre">&lt;seg&gt;Biomarkers and Integrated Solutions&lt;/seg&gt;</prop>
  <prop type="x-context-post">&lt;seg&gt;Novel therapeutic agents that have fast onset of action, good safety and tolerability profiles and that address common co-morbidities (for example, anxiety and substance abuse) &lt;ph type='fmt'&gt;{}&lt;/ph&gt;&lt;it pos='begin'&gt;&amp;lt;ul&amp;gt;&lt;/it&gt;&lt;/seg&gt;</prop>
  <seg><it pos='begin'>&lt;ul class=&quot;inline&quot;&gt;</it></seg>
</tuv>
***<tuv xml:lang="hu">***
  <seg><it pos='begin'>&lt;ul class=&quot;inline&quot;&gt;</it></seg>
</tuv>
</tu>
</body>
</tmx>

Output after procedure:

AABB-COR-09_Master_DE_out.log: 6293 SYB <tuv xml:lang="en-us">    <tuv xml:lang="de-de">
ABB-COR-09_Master_DE_out.log: AD    <tuv xml:lang="en-us">    <tuv xml:lang="de-de">
ABB-COR-09_Master_DE_out.log: AGENTILE  <tuv xml:lang="en-us">    <tuv xml:lang="de-de">
ABB-COR-09_Master_DE_out.log: ALIGN!    <tuv xml:lang="en-us">    <tuv xml:lang="de-de">
ABB-COR-09_Master_DE_out.log: ANGELIKA  <tuv xml:lang="en-us">    <tuv xml:lang="de-de">
ABB-COR-09_Master_DE_out.log: ASEDR <tuv xml:lang="en-us">    <tuv xml:lang="de-de">

Upvotes: 0

Views: 75

Answers (2)

woxxom
woxxom

Reputation: 73616

  • Use one pipeline for the entire processing instead of 4 separate passes.
  • Use string operators like -join and -split instead of writing and reading from the same file
  • Use [regex] class and its Matches method to extract all tokens you want.

$RX_EXTRACT = [regex](
    '(?<=(creationid|changeid)=")[^"]+(?=")|' +
    '<tuv.+?lang="[A-Za-z\-]+">'
) # the unwanted parts are suppressed from the output via look-behind and look-ahead

Get-ChildItem (Join-Path $TMX_DIR *.tmx) -Recurse | ForEach {
    $_.FullName + ': ' + (
        ($RX_EXTRACT.Matches((Get-Content $_ -raw)).Value | Get-Unique
        ) -join "`n" -replace '\n<', "`t<" -split "`n" -ne '' | Sort -Unique
    ) -join "`t"
} | Out-File "dir\tologs\output.txt"

Not tested extensively. Use it as an example.

Upvotes: 1

vonPryz
vonPryz

Reputation: 24071

As a first debugging step, I would implement progress indicator. Though there's the Write-Progress for nice output, simple version that prints dots will often do just fine. By looking at the dots, you can see if the script has stopped or if it is still running (although slowly.)

Start by saving the files in a variable instead of passing directly into the pipeline. Then you can easily log the number of files. When actually processing the files, print a dot . for every, say, ten thousand files. The actual denominator is up to you. 10000 is a good starting guess, as 2M/10k = 200 and there shouldn't be too much logs to read.

$tmxFiles = Get-ChildItem "MasterDirpath\*.tmx" -Recurse
write-host "Processing" $tmxFiles.Count ".tmx files"
$i=0;

$tmxFiles | % {
    if (++$i % 10000 -eq 0) {
        write-host -nonewline "."
    }
    # actual processing happens next
    ...
}

When advancing to the next step, use the same logic:

$toLogs = Get-ChildItem "dir\tologs" -Filter *.log
write-host "Processing" $toLogs.Count ".log files"
$i=0;

$toLogs | % {
    if (++$i % 10000 -eq 0) {
        write-host -nonewline "."
    }
    # actual processing
    ...
}

There's also Measure-Command that can be used to measure how long a scriptblock runs. Use it when you have figured out the more expensive parts of the process.

Upvotes: 1

Related Questions