Gordon Dugan
Gordon Dugan

Reputation: 129

use regex to get both link and text associated with it (anchor tag)

I created a regex string that I hoped would get both the link and the associated text in an html page. For instance, if I had a link such as:

<a href='www.la.com/magic.htm'>magicians of los angeles</a>

Then the link I want is 'www.la.com/magic.htm' and the text I want is 'magicians of los angeles'.

I used the following regex expression:

strsearch = "\<a\s+(.*?)\>(.*?)\</a\s*?\>|"

But my vb program told me I was getting too many matches. Is there something wrong with the regEx expression?

The circle-brackets are meant to get 'groups' that can be back-referenced. Thanks

Upvotes: 0

Views: 1316

Answers (2)

Ju-Hsien Lai
Ju-Hsien Lai

Reputation: 733

I tried with following pattern , it worked.

\<a href=(.*?)\>(.*?)\<\/a\s*?\>|

Also Found two errors on your origin string:

  • missed a escape syntax on /a
  • the reserved word 'href' is captured on first group

At last , i would like recommend you a great site to test REGEX string. It will helps your debug really fast. Refer this (also demonstrating the result you want) : REGEX101

Upvotes: 0

Veverke
Veverke

Reputation: 11348

What about this one:

\<a href=.+\</a>

All there is left to do is to go over each match and extract the substrings using regular string manipulation.

Check here (although regexr follows javascript regex implementation, it is still useful in our scenario)

With that being said, I often see people stating that regexes are not suited for parsing Html. You might need to use an Html Parser for this. You have HtmlAgilityPack, which is not maintained anymore, and AngleSharp, that I know of to recommend.

Upvotes: 2

Related Questions