ladders81
ladders81

Reputation: 51

Use Powershell to replace characters within a specific string

I'm using a Powershell script to automate the replacement of some troublesome characters from an xml file such as & ' - £

The script I have works well for these characters, but I also want to remove the double quote character " but only if it is used within an xml attribute (which unfortunately is enclosed by double quotes) so I obviously cannot remove all double quotes from the xml file as this will stop the attributes from working as they should.

My Powershell script is below:

(Get-Content C:\test\communication.xml) | 
Foreach-Object {$_ -replace "&", "+" -replace "£", "GBP" -replace "'", "" -replace "–", " "} |
Set-Content C:\test\communication.xml

What I'd like to be able to so is to remove ONLY the double quotes that make up part the XML attributes that are themselves enclosed by a pair of double quotes as below. I know that Powershell looks at each line as a separate object so suspect this should be quite easy, possibly by using conditions?

An example XML file is below:

<?xml version="1.0" encoding="UTF-8"?>
<Portal> 
<communication updates="Text data with no double quotes in the attribute" />
<communication updates="Text data that "includes" double quotes within the double quotes for the attribute" />
</Portal>

In the above example I'd like to remove only the double quotes that immediately surround the word includes BUT not the double quotes that are to the left of the word Text or to the right of the word attribute. The words used for the XML attributes will change on a regular basis but the left double quote will always be to the immediate right of the = symbol and the right double quote will always be to the left of a space forward slash combination / Thanks

Upvotes: 0

Views: 11624

Answers (1)

Nick
Nick

Reputation: 4362

Try this regex:

"(?<!\?xml.*)(?<=`".*?)`"(?=.*?`")"

In your code it would be:

(Get-Content C:\test\communication.xml) | 
Foreach-Object {$_ -replace "&", "+" `
    -replace "£", "GBP" `
    -replace "'", "" `
    -replace "–", " " `
    -replace "(?<!\?xml.*)(?<=`".*?)`"(?=.*?`")", ""} |
Set-Content C:\test\communication.xml

This will take any " that has a " in-front of and behind it (except a line that has ?xml in it) and replace it with nothing.

Edit to include breakdown of regex;

(?<!\?xml.*)(?<=`".*?)`"(?=.*?`")

1. (?<!\?xml.*)----> Excludes any line that has "?xml" before the first quote
2. (?<=`".*?)------> Lookbehind searching for a quotation mark.  
       The ` is to escape the quotation mark, which is needed for powershell
3. `"--------------> The actual quotation mark you are searching for
4. (?=.*?`")-------> Lookahead searching for a quotation mark

For more information about lookbehinds and lookaheads see this site

Upvotes: 1

Related Questions