Reputation: 129
I created a regex string that I hoped would get both the link and the associated text in an html page. For instance, if I had a link such as:
<a href='www.la.com/magic.htm'>magicians of los angeles</a>
Then the link I want is 'www.la.com/magic.htm' and the text I want is 'magicians of los angeles'.
I used the following regex expression:
strsearch = "\<a\s+(.*?)\>(.*?)\</a\s*?\>|"
But my vb program told me I was getting too many matches. Is there something wrong with the regEx expression?
The circle-brackets are meant to get 'groups' that can be back-referenced. Thanks
Upvotes: 0
Views: 1316
Reputation: 733
I tried with following pattern , it worked.
\<a href=(.*?)\>(.*?)\<\/a\s*?\>|
Also Found two errors on your origin string:
At last , i would like recommend you a great site to test REGEX string. It will helps your debug really fast. Refer this (also demonstrating the result you want) : REGEX101
Upvotes: 0
Reputation: 11348
What about this one:
\<a href=.+\</a>
All there is left to do is to go over each match and extract the substrings using regular string manipulation.
Check here (although regexr follows javascript regex implementation, it is still useful in our scenario)
With that being said, I often see people stating that regexes are not suited for parsing Html. You might need to use an Html Parser for this. You have HtmlAgilityPack, which is not maintained anymore, and AngleSharp, that I know of to recommend.
Upvotes: 2