user12475155
user12475155

Reputation: 15

PowerShell - Removing multiple lines of text between delimiters in a text file

I edit XML files and am using PowerShell to open them in Notepad and replace strings of text. Given two distinct delimiters, a starting and stopping, that appear multiple times in an XML file, I would like to completely remove the text between the delimiters (whether the delimiters get removed as well or not does not matter to me).

In the following example text, I want to completely remove the text between my starting and ending delimiter, but keep all the text before and after it.

The issue I am facing is the fact that there are newlines at the end of each line of text that prevents me from doing a simple:

-replace "<!--A6-->.*?<!--A6 end-->", "KEVIN"

Starting Delimiter:

<!--A6-->

Stopping Delimiter:

<!--A6 end-->

Example Text:

<listItem>
<para>Apple iPhone 6</para>
</listItem>
<listItem>
<para>Apple iPhone 8</para>
</listItem>
<!--A6-->
<listItem>
<para>Apple iPhone X</para>
</listItem>
<!--A6 end-->
</randomList></para>
</levelledPara>
<levelledPara>
<!--A6-->
<title>Available Apple iPhone Colors</title>
<para>The current iPhone model is available in
the follow colors.  You can purchase this model
in store, or online.</para>
<!--A6 end-->
<para>If the color option that you want is out
of stock, you can find them at the following
website link.</para>

Current Code:

$Directory = "C:\Users\hellokevin\Desktop\PSTest"

$FindBook = "Book"

$ReplaceBook = "Novel"

$FindBike = "Bike"

$ReplaceBike = "Bicycle"

Get-ChildItem -Path $Directory -Recurse |
    Select-Object -Expand FullName|
        ForEach-Object {
            (Get-Content $_) -replace $FindBook,$ReplaceBook -replace "<!--A6-->.*?<!--A6 end-->", "KEVIN" |
            Set-Content ($_ + "_new.xml")
        }

Any help would be greatly appreciated. Being fairly new to PowerShell, I don't know how to factor in the newlines at the end of each line in my code. Thanks for looking!

Upvotes: 1

Views: 1316

Answers (2)

mklement0
mklement0

Reputation: 439193

Note:

  • Generally, for robust processing, you should use a dedicated XML parser to parse XML text.

  • In the specific case at hand, using a regex is a convenient shortcut, with the caveat that it only works because the blocks of lines being removed are self-contained elements or element sequences; if this assumption doesn't hold, the modifications will invalidate the XML document.

    • Additionally, there may be character-encoding issues, because reading an XML file as text doesn't honor an explicit encoding attribute potentially present in the file's XML declaration - see the bottom section for details.

    • That said, the technique below is appropriate for modifying plain-text files that have no specific formal structure.


  • You need to use the s (SingleLine) regex option to ensure that . also matches newlines - such options, if used inline, must be placed inside (?...) at the start of the regex; that is, '(?s)...' in this case.

    • Ad hoc, you can alternatively use workaround [\s\S] instead of ., as suggested by x15; this expression matches any character that is a whitespace char. or a non-whitespace char., and therefore matches any char., including newlines.
  • To fully remove the lines of interest, you must also match the preceding and succeeding newline.

(Get-Content -Raw file.xml) -replace '(?s)\r?\n<!--A6-->.*?<!--A6 end-->\r?\n'
  • Get-Content -Raw file.xml reads the file into memory as a whole (single string).

    • Get-Content makes assumptions about a file's character encoding in the absence of a BOM: Windows PowerShell assumes ANSI encoding, and PowerShell [Core] v6+ now sensibly assumes UTF-8. Since Get-Content is a general-purpose text-file reading cmdlet, it is not aware of a potential encoding attribute in the XML declaration of XML input files (e.g.,
      <?xml version="1.0" encoding="ISO-8859-1"?>)
    • Similarly, Set-Content defaults to ANSI in Windows PowerShell, and BOM-less UTF-8 PowerShell [Core] v6+.
    • When in doubt, use the -Encoding parameter, both with Get-Content and Set-Content
    • See bottom section for more information.
  • \r?\n matches both Windows-style CRLF newlines and Unix-style LF-only ones.

  • Use (?:\r?\n)? instead of \r?\n if newlines aren't guaranteed to precede / succeed the lines of interest.

To verify that the resulting string is still a valid XML document, simply cast the command (or its captured result) to [xml]: [xml] ((Get-Content ...) -replace ...)

If you find that the document is broken, use Tomalak's fully robust, but more complex XML-parsing answer.


XML files and character encodings:

If you use Get-Content to read an XML file as text, and that file has neither a UTF-8 BOM nor a UTF-16 / UTF-32 BOM, Get-Content makes an assumption: it assumes ANSI encoding (e.g., Windows-1252) in Windows PowerShell, and, more sensibly, UTF-8 encoding in PowerShell [Core] v6+. Since Get-Content is a general-purpose text-file reading cmdlet, it is not aware of a potential encoding attribute in the XML declaration of XML input files.

  • If you know the actual encoding, use the -Encoding parameter to specify it.

  • Use -Encoding with the same value for saving the file with Set-Content later: As is generally the case in PowerShell, once data has been loaded into memory by a file-reading cmdlet, no information about its original encoding is retained, and using a file-writing cmdlet such as Set-Content later uses its fixed default encoding, which again, is ANSI in Windows PowerShell, and BOM-less UTF-8 in PowerShell [Core] v6+. Note that, unfortunately, different cmdlets have different defaults in Windows PowerShell, whereas PowerShell [Core] v6+ commendably consistently defaults to UTF-8.

The System.Xml.XmlDocument .NET type (whose PowerShell type accelerator is [xml]) offers robust XML parsing, and using its .Load() and .Save() methods provide better encoding support if the document's XML declaration contains an explicit encoding attribute naming the encoding used:

  • If such an attribute is present (e.g., <?xml version="1.0" encoding="ISO-8859-1"?>), both .Load() and .Save() will honor it.

    • That is an input file with an encoding attribute will be read correctly, and saved with that same encoding.
    • Of course, this assumes that the encoding named in the encoding attribute reflect's the input file's actual encoding.
  • Otherwise, if the file has no BOM, (BOM-less) UTF-8 is assumed, as with PowerShell [Core] v6+'s Get-Content / Set-Content - that is sensible, because an XML document that has neither an encoding attribute nor a UTF-8 or UTF-16 BOM should default to UTF-8, per the W3C XML Recommendation; if the file does have a BOM, only UTF-8 and UTF-16 are permitted without also naming the encoding in an encoding attribute, although in practice XmlDocument also reads UTF-32 files with a BOM correctly.

    • This means that .Save() will not preserve the encoding of a (with-BOM) UTF-16 or UTF-32 file that doesn't have an encoding attribute, and will instead create a BOM-less UTF-8 file.

    • If you want to detect a file's actual encoding - as either inferred from its BOM / absence thereof or, if present, the encoding attribute, read your file via an XmlTextReader instance:

      # Create an XML reader.
      $xmlReader = [System.Xml.XmlTextReader]::new(
        "$pwd/some.xml" # IMPORTANT: use a FULL PATH
      )
      
      # Read past the declaration, which detects the encoding,
      # whether via the presence / absence of a BOM or an explicit
      # `encoding` attribute.
      $null = $xmlReader.MoveToContent()
      
      # Report the detected encoding.
      $xmlReader.Encoding
      
      # You can now pass the reader to .Load(), if needed
      # See next section for how to *save* with the detected encoding.
      $xmlDoc = [xml]::new()
      $xmlDoc.Load($xmlReader)
      $xmlReader.Close()
      
    • If a given file is non-compliant and you know the actual encoding used and/or you want to save with a given encoding (be sure that it doesn't contradict the encoding attribute, if there is one), you can specify encodings explicitly (the equivalent of using -Encoding with Get-Content / Set-Content), use the .Load() / .Save() method overloads that accepts a Stream instance, via StreamReader / StreamWriter instances constructed with a given encoding; e.g.:

      # Get the encoding to use, matching the input file's.
      # E.g., if the input file is ISO-8859-1-encoded, but lacks
      # an `encoding` attribute in the XML declaration.
      $enc = [System.Text.Encoding]::GetEncoding('ISO-8859-1')
      
      # Create a System.Xml.XmlDocument instance.
      $xmlDoc = [xml]::new()
      # Create a stream reader for the input XML file
      # with explicit encoding.
      $streamIn = [System.IO.StreamReader]::new(
        "$pwd/some.xml", # IMPORTANT: use a FULL PATH
        $enc
      )
      # Read and parse the file.
      $xmlDoc.Load($streamIn)
      # Close the stream
      $streamIn.Close()
      
      # ... process the XML DOM.
      
      # Create a stream *writer* for saving back to the file
      # with the same encoding.
      $streamOut = [System.IO.StreamWriter]::new(
        "$pwd/t.xml", # IMPORTANT: use a FULL PATH
        $false, # don't append
        $enc    # same encoding as above in this case.
      )
      
      # Save the XML DOM to the file.
      $xmlDoc.Save($streamOut)
      # Close the stream
      $streamOut.Close()
      

A general caveat re passing file paths to .NET methods: Always use full paths, because .NET's idea of the current directory typically differs from PowerShell's.

Upvotes: 0

Tomalak
Tomalak

Reputation: 338316

Using search-and-replace on XML files is extremely inadvisable and should be avoided at all costs, because it's way too easy to damage the XML this way.

There are better ways of modifying XML, and they all follow this schema:

  • load the XML document
  • modify the document tree
  • write the XML document back to file.

For your case ("remove nodes between markers") this could be as follows:

  • load the XML document
  • look at all XML nodes, in document order
  • when we see a comment that reads "A6", set a flag to remove nodes from now on
  • when we see a comment that reads "A6 end", unset that flag
  • collect all nodes that should be removed (that come up while the flag is on)
  • in a last step, remove them
  • write the XML document back to file.

The following program would do exactly this (and also remove the "A6" comments themselves):

$doc = New-Object xml
$doc.Load("C:\path\to\your.xml")

$toRemove = @()
$A6flag = $false
foreach ($node in $doc.SelectNodes('//node()')) {
    if ($node.NodeType -eq "Comment") {
        if ($node.Value -eq 'A6') {
            $A6flag = $true
            $toRemove += $node
        } elseif ($node.Value -eq 'A6 end') {
            $A6flag = $false
            $toRemove += $node
        }
    } elseif ($A6flag) {
        $toRemove += $node
    }
}
foreach ($node in $toRemove) {
    [void]$node.ParentNode.RemoveChild($node)
}

$doc.Save("C:\path\to\your_modified.xml")

You could do string replacement inside the foreach loop as well:

if ($node.NodeType -eq "Text") {
    $node.Value = $node.Value -replace "Apple","APPLE"
}

Doing -replace on a single $node.Value is safe. Doing -replace on the entire XML is not.

Upvotes: 1

Related Questions