Reputation: 199
I need to collect all links out of text in javascript with regex, separating the actual content of href and the text of the link. So if the link is
<a href="someplace/topics/us/john.htm" class="r_lapi">John Dow</a>
I want to collect the content of href and "John Dow".
The links have class="r_lapi" in them that would identify the links I'm looking for. What I have right now is:
var link_regex = new RegExp("/<a[^]*</a>/");
var match = content.match(link_regex, 'i');
console.log("match =", match );
Which does absolutely nothing. Any help is very much appreciated.
Upvotes: 0
Views: 90
Reputation: 1074238
If you can use the DOM (you've said you want regex, but...)
var i;
var links = document.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
// use `links[i].innerHTML` here
}
You've said in a comment that you're trying to do this with regex because you're receiving the link HTML (presumably mixed with a bunch of other stuff) via ajax. You can use the browser to parse it and then look for the links in the parsed result, without adding the HTML to your document, using a disconnected element:
var div, links, i;
// Create an element; note we don't append it anywhere
div = document.createElement('div');
// Fill it in with the HTML
div.innerHTML = text;
// Find relevant links (same as the earlier example)
links = div.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
// use `links[i].innerHTML` here
}
Live Example, using this text returned via ajax:
<a href="someplace/topics/us/john.htm" class="r_lapi">John Dow</a>
<a href="foo">Don't pick me</a>
<a href="blahblahblah" class="r_lapi">Jane Bloggs</a>
The only real "gotcha" here is that if the HTML contains image tags, the browser will start downloading those images (even though they won't be shown anywhere). This is true even if you use a document fragment, which is part of why I didn't bother above. (script
tags in the text aren't a problem, they aren't executed when you use innerHTML
but beware they are executed by things like jQuery's html
function.)
Or if the data is coming back to you in some other form (like JSON), with the HTML in it, parse the JSON (or whatever) and then run each HTML fragment through the div one at a time:
function handleLinks(data) {
var div, links, htmlIndex, linkIndex;
div = document.createElement('div');
for (htmlIndex = 0; htmlIndex < data.htmlList.length; ++htmlIndex) {
div.innerHTML = data.htmlList[htmlIndex];
links = div.querySelectorAll("a.r_lapi");
for (linkIndex = 0; linkIndex < links.length; ++linkIndex) {
// Use `links[linkIndex].innerHTML` here
}
}
}
Live Example, using this JSON returned via ajax:
{
"htmlList": [
"blah blah <a href=\"someplace/topics/us/john.htm\" class=\"r_lapi\">John Dow</a> blah blah",
"<a href=\"foo\">Don't pick me</a>",
"Two in this one <a href=\"blahblahblah\" class=\"r_lapi\">Jane Bloggs</a> and <a href=\"blahblahblah\" class=\"r_lapi\">Trevor Bloggs</a>"
]
}
If you really need to use regex:
Beware that you cannot do this reliably with regular expressions in JavaScript; you need a parser.
You can get close with a couple of assumptions.
var link_regex = /<a(?:>|\s[^>]*>)(.*?)<\/a>/i;
var match = content.match(link_regex);
if (match) {
// Use match[1], which contains it
}
That looks for this:
<a
>
immediately following, or at least one whitespace character followed by any number of characters that aren't a >
, followed by a >
</a>
The "minimal match" in Step 3 is so we don't get more than we want if we have <a>first</a><a>second</a>
.
I haven't tried to limit the regex by the class, I'll leave that as an exercise for the reader. :-)
Again, though, this is a bad idea. Instead, use the DOM (if you're doing this outside a browser, there are plenty of DOM implementations you can use).
One of the primary assumptions made with the above above are that there is never a >
character within an attribute value in the anchor (e.g., <a href="..." data-something="I have a > in me">John Dow></a>). It's perfectly valid to have a
>` inside an attribute value, so that assumption is invalid.
Upvotes: 1
Reputation: 27205
If you're in a browser, you really should be using the native DOM.
If you're not, assuming the href does not contain weird characters like >
or "
, you could use following regex:
var matches = link.match(/^<a\s+[^>]*href="([^"]+)"[^>]*>([^<]*)<\/a>$/);
matches[1] == "someplace/topics/us/john.htm";
matches[2] == "John Dow";
Please note that this will fail on certain links like
<a href=">">test</a>
<a href="test">John <b>Dow</b></a>
For a complete solution, use a HTML parser.
Upvotes: 1