Reputation: 12354
Given the following string:
'<p><a href="china">China</a><br><a href="india">India</a><br><a
href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a
href="thailand">Thailand</a></p>'
I'd like to use Powershell to extract all of the countries listed therein. In other words I want to return @(China,India,Korea,Malaysia,Thailand).
Have tried using regex but can't find the right pattern, for example:
'<p><a href="china">China</a><br><a href="india">India</a><br><a href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a href="thailand">Thailand</a></p>' -match '(<a href="[A-Z a-z]*">[A-Z a-z]*</a>)+'
$matches
Which returns:
Name Value
---- -----
1 <a href="china">China</a>
0 <a href="china">China</a>
Any suggestions? Is regex the right approach here?
P.S. Note that the snippet is not well-formed so I can't simply convert it to XML.
Upvotes: 0
Views: 2394
Reputation: 11
$InputString='<p><a href="china">China</a><br><a href="india">India</a><br><a href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a href="thailand">Thailand</a></p>'
$Pattern='(?<=>)\w+?(?=<)'
([Regex]::Matches($InputString,$Pattern)).Value
China
India
Korea
Malaysia
Thailand
Upvotes: 0
Reputation: 22132
$Matches
automatic variable contains information about matched capturing groups of last -match
operation, not information about matches. If you want to get multiple matches of pattern, then you have to use Matches
method from [Regex]
class:
$InputString='<p><a href="china">China</a><br><a href="india">India</a><br><a href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a href="thailand">Thailand</a></p>'
$Pattern='<a href="[A-Z a-z]*">([A-Z a-z]*)</a>'
$Countries=[Regex]::Matches($InputString,$Pattern)|ForEach-Object {$_.Groups[1].Value}
$Countries
Although for parsing HTML you better to use some HTML parser as other answer propose to you.
Upvotes: 3
Reputation: 95742
Regular expressions are never a good way to handle HTML (though often they are tempting). You can parse the HTML and extract the data you want without using any regex:
PS C:\> $d = '<p><a href="china">China</a><br><a href="india">India</a><br><a
href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a
href="thailand">Thailand</a></p>'
PS C:\> $html = New-Object -ComObject "HTMLFile"
PS C:\> $html.IHTMLDocument2_write($d)
PS C:\> $html.getElementsByTagName('A') | select -expandProperty innerText
China
India
Korea
Malaysia
Thailand
Upvotes: 1
Reputation: 21
The following Regex should do the trick:
(?<=><a\shref="\w+">)\w+
ML
Upvotes: 0