lw0
lw0

Reputation: 199

javascript regex for links and links class

I need to collect all links out of text in javascript with regex, separating the actual content of href and the text of the link. So if the link is

<a href="someplace/topics/us/john.htm" class="r_lapi">John Dow</a>

I want to collect the content of href and "John Dow".

The links have class="r_lapi" in them that would identify the links I'm looking for. What I have right now is:

     var link_regex = new RegExp("/<a[^]*</a>/");
     var match = content.match(link_regex, 'i');
     console.log("match =", match );

Which does absolutely nothing. Any help is very much appreciated.

Upvotes: 0

Views: 90

Answers (2)

T.J. Crowder
T.J. Crowder

Reputation: 1074238

If you can use the DOM (you've said you want regex, but...)

var i;
var links = document.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
    // use `links[i].innerHTML` here
}

You've said in a comment that you're trying to do this with regex because you're receiving the link HTML (presumably mixed with a bunch of other stuff) via ajax. You can use the browser to parse it and then look for the links in the parsed result, without adding the HTML to your document, using a disconnected element:

var div, links, i;

// Create an element; note we don't append it anywhere
div = document.createElement('div');

// Fill it in with the HTML
div.innerHTML = text;

// Find relevant links (same as the earlier example)
links = div.querySelectorAll("a.r_lapi");
for (i = 0; i < links.length; ++i) {
    // use `links[i].innerHTML` here
}

Live Example, using this text returned via ajax:

<a href="someplace/topics/us/john.htm" class="r_lapi">John Dow</a>
<a href="foo">Don't pick me</a>
<a href="blahblahblah" class="r_lapi">Jane Bloggs</a>

The only real "gotcha" here is that if the HTML contains image tags, the browser will start downloading those images (even though they won't be shown anywhere). This is true even if you use a document fragment, which is part of why I didn't bother above. (script tags in the text aren't a problem, they aren't executed when you use innerHTML but beware they are executed by things like jQuery's html function.)

Or if the data is coming back to you in some other form (like JSON), with the HTML in it, parse the JSON (or whatever) and then run each HTML fragment through the div one at a time:

function handleLinks(data) {
  var div, links, htmlIndex, linkIndex;

  div = document.createElement('div');
  for (htmlIndex = 0; htmlIndex < data.htmlList.length; ++htmlIndex) {
    div.innerHTML = data.htmlList[htmlIndex];
    links = div.querySelectorAll("a.r_lapi");
    for (linkIndex = 0; linkIndex < links.length; ++linkIndex) {
      // Use `links[linkIndex].innerHTML` here
    }
  }
}

Live Example, using this JSON returned via ajax:

{
    "htmlList": [
        "blah blah <a href=\"someplace/topics/us/john.htm\" class=\"r_lapi\">John Dow</a> blah blah",
        "<a href=\"foo\">Don't pick me</a>",
        "Two in this one <a href=\"blahblahblah\" class=\"r_lapi\">Jane Bloggs</a> and <a href=\"blahblahblah\" class=\"r_lapi\">Trevor Bloggs</a>"
    ]
}

If you really need to use regex:

Beware that you cannot do this reliably with regular expressions in JavaScript; you need a parser.

You can get close with a couple of assumptions.

 var link_regex = /<a(?:>|\s[^>]*>)(.*?)<\/a>/i;
 var match = content.match(link_regex);
 if (match) {
     // Use match[1], which contains it
 }

Live illustration

That looks for this:

  1. The literal text <a
  2. Either a > immediately following, or at least one whitespace character followed by any number of characters that aren't a >, followed by a >
  3. Any number of characters, minimal-match
  4. The literal text </a>

The "minimal match" in Step 3 is so we don't get more than we want if we have <a>first</a><a>second</a>.

I haven't tried to limit the regex by the class, I'll leave that as an exercise for the reader. :-)

Again, though, this is a bad idea. Instead, use the DOM (if you're doing this outside a browser, there are plenty of DOM implementations you can use).

One of the primary assumptions made with the above above are that there is never a > character within an attribute value in the anchor (e.g., <a href="..." data-something="I have a > in me">John Dow></a>). It's perfectly valid to have a>` inside an attribute value, so that assumption is invalid.

Upvotes: 1

Bart
Bart

Reputation: 27205

If you're in a browser, you really should be using the native DOM.

If you're not, assuming the href does not contain weird characters like > or ", you could use following regex:

var matches = link.match(/^<a\s+[^>]*href="([^"]+)"[^>]*>([^<]*)<\/a>$/);
matches[1] == "someplace/topics/us/john.htm";
matches[2] == "John Dow";

Please note that this will fail on certain links like

  • <a href=">">test</a>
  • <a href="test">John <b>Dow</b></a>

For a complete solution, use a HTML parser.

Upvotes: 1

Related Questions