user3195770
user3195770

Reputation: 13

Powershell script to remove line of text from files in folder

We have a program that creates email signatures and stores them in a deployment folder that is then saved to the users local folder when they log in. However when the employee is not assigned to an office, the comma separator for City/State still come along for the ride as shown in this example:

Example Email signature

Problem is the program source code cannot be found. Long term I will rewrite it. Short term I need a powershell script that will run every night to remove the line containing the commas. Found the following solution here on Stackoverflow:

Get-ChildItem C:\temp\emailsigs -Filter *.htm | Foreach-Object{
(Get-Content $_.FullName) | 
Foreach-Object {$_ -replace " ,   &nbsp; ,   &nbsp; <br />", ""} | 
Set-Content $_.FullName
}

This actually works pretty well. But I notice that each signature HTM file (over 1100) is getting the timestamp update even when only 2 email signatures need to have the empty comma line removed. Is there a more efficient way to first check if the file contains the offending commas to then replace and skip over the majority?

Upvotes: 1

Views: 1275

Answers (2)

mklement0
mklement0

Reputation: 438133

The following PSv5+ solution won't be memory-efficient, but should speed up processing while avoiding rewriting of files that don't need it:

Get-ChildItem C:\temp\emailsigs -Filter *.htm |
  ForEach-Object {
    $oldContent = Get-Content -Raw $_.FullName
    $newContent = $oldContent -replace ' ,   &nbsp; ,   &nbsp; <br />'
    if ($newContent.Length -lt $oldContent.Length) { # was a replacement performed?
      Set-Content $_.FullName -NoNewline -Value $newContent
    }
  }
  • -Raw is PSv3+ and reads the entire file as a single string.

    • In PSv2, you could use [System.IO.File]::ReadAllText() instead, but note that it assumes UTF-8 as the encoding in the absence of a BOM, whereas Get-Content assumes "ANSI" encoding[1] (the system's legacy "ANSI" code page), so you may have to specify the encoding explicitly.
  • Processing each file as a single string speeds up processing (though each file must fit into memory twice). Taking advantage of -replace leaving an input string unmodified if the regex doesn't match, we can compare the length of the original contents to the length of the replaced contents to see if something matched and that the file therefore needs rewriting.
    Thus, we only need a single regex operation per file.

    • Also note that ... -replace '...' - i.e., not specifying a replacement string - is equivalent to ... -replace '...', '', i.e., to effectively remove what was matched.
  • -NoNewline requires PSv5+; it prevents an additional newline from getting appended on output.

    • In PSv4-, you could use [System.IO.File]::WriteAllText() instead, but note that its default encoding is UTF-8 without a BOM, whereas Set-Content, like Get-Content, defaults to "ANSI" encoding[1].

[1] The above applies to Windows PowerShell. The cross-platform PowerShell Core edition defaults to (BOM-less) UTF-8 as well.

Upvotes: 2

Esperento57
Esperento57

Reputation: 17472

Other method

Get-ChildItem C:\temp\emailsigs -file -Filter *.htm | foreach{

$CurrentFile=$_

$Content=Get-Content $CurrentFile.FullName -Encoding UTF8

if ($Content -like '* ,   &nbsp; ,   &nbsp; <br />*')
{
    $Content.Replace(' ,   &nbsp; ,   &nbsp; <br />', '') | Set-Content $CurrentFile.FullName -Encoding UTF8
}

}

I use utf8 for keep diacritics

Upvotes: 0

Related Questions