Marc Kean
Marc Kean

Reputation: 533

Match particular CDATA sections in XML data

I am trying to do a PowerShell Regex, I have the following page (further below) that I want to do a match from, the two parts in bold is the information that I want to capture and assign to a variable. So I need two regex's. From the text below, the two area's I need to find exactly are King and Years & Years. Please note, these two areas change (hence the reason I need to capture them), the rest of of the code stays the same.

This is the regex I have at the moment, but it's not working for me.

\s+artist\s*>\s*<\s*!\s*[CDATA\s*[(.*)\s*]\s*]\s*>\s*<\s*/artist

And here is the page (or data) I am trying to use regex with.

<on_air>
  <publishedInfo publishedDate="2015-07-18 16:24:28" />
  <stationName><![CDATA[Mix 106.5]]></stationName>
  <stationPrefix><![CDATA[mix1065]]></stationPrefix>
  <generic_coverart><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></generic_coverart>
  <now_playing>
    <audio ID="id_1705168034_30458146" type="song">
      <title generic="False"><![CDATA[King*]]></title>
      <artist><![CDATA[Years & Years]]></artist>
      <number><![CDATA[46029]]></number>
      <cut><![CDATA[1]]></cut>
      <ref><![CDATA[]]></ref>
      <played_datetime><![CDATA[2015-07-18 16:24:27]]></played_datetime>
      <length><![CDATA[00:03:28]]></length>
      <coverart generic="true"><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></coverart>
      <options>
        <option><![CDATA[KIIS S Integrated]]></option>
      </options>
    </audio>
  </now_playing>

Upvotes: 2

Views: 373

Answers (3)

Luv2code
Luv2code

Reputation: 1109

You want to escape bracket literals.

Also, it's a good practice to avoid using the dot "match almost any character" metacharacter when your intentions are more specific. In your case, what you really want to do is match until you hit the closing bracket, so it's safer to specify that:

'\s+artist\s*>\s*<\s*!\s*\[CDATA\s*\[([^]]*)\s*\]\s*\]\s*>\s*<\s*\/artist'

Note: Regex is contextual, so the reason I don't have to escape the closing bracket within the character class is because of its position, i.e., being the first character specified in the negated class--in that context, it cannot be the closing bracket for the character class. In other words, it's not ambiguous.

Upvotes: 1

user4003407
user4003407

Reputation: 22132

If it is a valid XML, then you does not need to use regular expressions. PowerShell adapt XML objects and you can use standard property syntax to navigate on them:

$xml=[xml]@'
<on_air>
  <publishedInfo publishedDate="2015-07-18 16:24:28" />
  <stationName><![CDATA[Mix 106.5]]></stationName>
  <stationPrefix><![CDATA[mix1065]]></stationPrefix>
  <generic_coverart><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></generic_coverart>
  <now_playing>
    <audio ID="id_1705168034_30458146" type="song">
      <title generic="False"><![CDATA[King*]]></title>
      <artist><![CDATA[Years & Years]]></artist>
      <number><![CDATA[46029]]></number>
      <cut><![CDATA[1]]></cut>
      <ref><![CDATA[]]></ref>
      <played_datetime><![CDATA[2015-07-18 16:24:27]]></played_datetime>
      <length><![CDATA[00:03:28]]></length>
      <coverart generic="true"><![CDATA[http://media.arn.com.au/images/getImage.aspx?i=generic_mix1065.jpg]]></coverart>
      <options>
        <option><![CDATA[KIIS S Integrated]]></option>
      </options>
    </audio>
  </now_playing>
</on_air>
'@
$xml.on_air.now_playing.audio.title.'#cdata-section'
$xml.on_air.now_playing.audio.artist.'#cdata-section'

Upvotes: 4

MBaas
MBaas

Reputation: 7530

To help get off the ground, here is a suggestion for y&y (insert whitespace-selector whereever possible):

artist><!\[CDATA\[Years & Years\]\]></artist

Upvotes: 0

Related Questions