loveforfire33
loveforfire33

Reputation: 1110

Get URL of hyperlink based on anchortext

So im trying to get the URL with every link that contains the word blog as an anchor text.

EG:

<a href="http://asdas.com/blog">this is our blog</a>
<a href="http://asdas.com/blog">BLOG</a>
<a href="http://asdas.com/blog">   blogging   </a>

result: http://asdas.com/blog

This works fine, unless there are more html tags in the link...

<a class="asdadasd" href="http://asdas.com/blog" id="asdasd">this is our blog</a>

Result: http://asdas.com/blog" id="asdasd

Here's what i've got

(?i)<a.+href="(.*)".*>.*?blog.*?</a>

Upvotes: 0

Views: 41

Answers (2)

revo
revo

Reputation: 48711

Using RegEx alone is a headache. Never parse HTML documents with RegEx. Do it with DOMParser():

var html = `<a href="http://asdas.com/blog">this is our blog</a>
<a href="http://asdas.com/blog">BLOG</a>
<a href="http://asdas.com/blog">   test   </a>`;

var doc = (new DOMParser()).parseFromString(html, 'text/html')
var aTags = doc.documentElement.getElementsByTagName('a')

Array.prototype.slice.call(aTags).forEach(function(a) {
   if(a.innerText.match(/blog/i))
     console.log(a.href)
});

Upvotes: 0

RToyo
RToyo

Reputation: 2877

You will need to use ? to make your (.*) lazy. Otherwise your .* will continue to grab everything that it can until it reaches the final closing ".

Try this:

(?i)<a.+href="(.*?)".*>.*?blog.*?</a>

All I've done is change (.*) to (.*?).

Upvotes: 1

Related Questions