Mike J
Mike J

Reputation: 1240

How can I parse an XML file and delete text between two tags using PowerShell?

I have a file that has multiple instances of the following:

<password encrypted="True">271NFANCMnd8BFdERjHoAwEA7BTuX</password>

But for each instance the password is different.

I would like the output to delete the encyrpted password:

<password encrypted="True"></password>

What is the best method using PowerShell to loop through all instances of the pattern within the file and output to a new file?

Something like:

gc file1.txt | (regex here) > new_file.txt

where (regex here) is something like:

s/"True">.*<\/pass//

Upvotes: 1

Views: 2913

Answers (1)

briantist
briantist

Reputation: 47832

This one is fairly easy in regex, and you can do it that way, or you can parse it as actual XML, which may be more appropriate. I'll demonstrate both ways. In each case, we'll start with this common bit:

$raw = @"
<xml>
    <something>
        <password encrypted="True">hudhisd8sd9866786863rt</password>
    </something>
    <another>
        <thing>
            <password encrypted="True">nhhs77378hd8y3y8y282yr892</password>
        </thing>
    </another>
    <test>
        <password encrypted="False">plain password here</password>
    </test>
</xml>
"@

Regex

$raw -ireplace '(<password encrypted="True">)[^<]+(</password>)', '$1$2'

or:

$raw -ireplace '(?<=<password encrypted="True">).+?(?=</password>)', ''

XML

$xml = [xml]$raw

foreach($password in $xml.SelectNodes('//password')) {
    $password.InnerText = ''
}

Only replace the encrypted passwords:

$xml = [xml]$raw

foreach($password in $xml.SelectNodes('//password[@encrypted="True"]')) {
    $password.InnerText = ''
}

Explanations

Regex 1

(<password encrypted="True">)[^<]+(</password>)

Regular expression visualization

Debuggex Demo

The first regex method uses 2 capture groups to capture the opening and closing tags, and replaces the entire match with those tags (so the middle is omitted).

Regex 2

(?<=<password encrypted="True">).+?(?=</password>)

Regular expression visualization

Debuggex Demo

The second regex method uses positive lookaheads and lookbehinds. It finds 1 or more characters which are preceded by the opening tag and followed by the closing tag. Since lookarounds are zero-width, they are not part of the match, therefore they don't get replaced.

XML

Here we're using a simple xpath query to find all of the password nodes. We iterate through each one with a foreach loop and set its innerText to an empty string.

The second version checks that the encrypted attribute is set to True and only operates on those.

Which to Choose

I personally think that the XML method is more appropriate, because it means you don't have to account for variations in XML syntax so much. You can also more easily account for different attributes specified on the nodes or different attribute values.

By using xpath you have a lot more flexibility than with regex for processing XML.

File operations

I noticed your sample to read the data used gc (short for Get-Content). Be aware that this reads the file line-by-line.

You can use this to get your raw content in one string, for conversion to XML or processing by regex:

$raw = Get-Content file1.txt -Raw

You can write it out pretty easily too:

$raw | Out-File file1.txt

Upvotes: 5

Related Questions