Reputation: 1110
So im trying to get the URL with every link that contains the word blog as an anchor text.
EG:
<a href="http://asdas.com/blog">this is our blog</a>
<a href="http://asdas.com/blog">BLOG</a>
<a href="http://asdas.com/blog"> blogging </a>
result: http://asdas.com/blog
This works fine, unless there are more html tags in the link...
<a class="asdadasd" href="http://asdas.com/blog" id="asdasd">this is our blog</a>
Result: http://asdas.com/blog" id="asdasd
Here's what i've got
(?i)<a.+href="(.*)".*>.*?blog.*?</a>
Upvotes: 0
Views: 41
Reputation: 48711
Using RegEx alone is a headache. Never parse HTML documents with RegEx. Do it with DOMParser()
:
var html = `<a href="http://asdas.com/blog">this is our blog</a>
<a href="http://asdas.com/blog">BLOG</a>
<a href="http://asdas.com/blog"> test </a>`;
var doc = (new DOMParser()).parseFromString(html, 'text/html')
var aTags = doc.documentElement.getElementsByTagName('a')
Array.prototype.slice.call(aTags).forEach(function(a) {
if(a.innerText.match(/blog/i))
console.log(a.href)
});
Upvotes: 0
Reputation: 2877
You will need to use ?
to make your (.*)
lazy. Otherwise your .*
will continue to grab everything that it can until it reaches the final closing "
.
Try this:
(?i)<a.+href="(.*?)".*>.*?blog.*?</a>
All I've done is change (.*)
to (.*?)
.
Upvotes: 1