jamiet
jamiet

Reputation: 12354

Extract multiple occurrences of a substring using Powershell

Given the following string:

'<p><a href="china">China</a><br><a href="india">India</a><br><a
href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a
href="thailand">Thailand</a></p>'

I'd like to use Powershell to extract all of the countries listed therein. In other words I want to return @(China,India,Korea,Malaysia,Thailand).

Have tried using regex but can't find the right pattern, for example:

'<p><a href="china">China</a><br><a href="india">India</a><br><a href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a href="thailand">Thailand</a></p>'  -match '(<a href="[A-Z a-z]*">[A-Z a-z]*</a>)+'
$matches

Which returns:

Name                           Value                                                                                                                                                                                            
----                           -----                                                                                                                                                                                            
1                              <a href="china">China</a>                                                                                                                                                                        
0                              <a href="china">China</a>

Any suggestions? Is regex the right approach here?

P.S. Note that the snippet is not well-formed so I can't simply convert it to XML.

Upvotes: 0

Views: 2394

Answers (4)

qdl
qdl

Reputation: 11

$InputString='<p><a href="china">China</a><br><a href="india">India</a><br><a href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a href="thailand">Thailand</a></p>'
$Pattern='(?<=>)\w+?(?=<)'

([Regex]::Matches($InputString,$Pattern)).Value

China

India

Korea

Malaysia

Thailand

Upvotes: 0

user4003407
user4003407

Reputation: 22132

$Matches automatic variable contains information about matched capturing groups of last -match operation, not information about matches. If you want to get multiple matches of pattern, then you have to use Matches method from [Regex] class:

$InputString='<p><a href="china">China</a><br><a href="india">India</a><br><a href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a href="thailand">Thailand</a></p>'
$Pattern='<a href="[A-Z a-z]*">([A-Z a-z]*)</a>'
$Countries=[Regex]::Matches($InputString,$Pattern)|ForEach-Object {$_.Groups[1].Value}
$Countries

Although for parsing HTML you better to use some HTML parser as other answer propose to you.

Upvotes: 3

Duncan
Duncan

Reputation: 95742

Regular expressions are never a good way to handle HTML (though often they are tempting). You can parse the HTML and extract the data you want without using any regex:

PS C:\> $d = '<p><a href="china">China</a><br><a href="india">India</a><br><a
href="korea">Korea</a><br><a href="malaysia">Malaysia</a><br><a
href="thailand">Thailand</a></p>'


PS C:\> $html = New-Object -ComObject "HTMLFile"

PS C:\> $html.IHTMLDocument2_write($d)

PS C:\> $html.getElementsByTagName('A') | select -expandProperty innerText
China
India
Korea
Malaysia
Thailand

Upvotes: 1

Matija Lah
Matija Lah

Reputation: 21

The following Regex should do the trick:

(?<=><a\shref="\w+">)\w+

ML

Upvotes: 0

Related Questions