Reputation: 1148
I have written this regexp: <(a*)\b[^>]*>.*?</\1>
and is tested on this regexp testing site: http://gskinner.com/RegExr/?2tntr
The point of the regexp is to go through a sites HTML and find all of the links. It should then return these in an Array for me to manipulate.
On the regexp testing site it works perfectly, but when put in action with JavaScript on my site it returns null.
JavaScript looks like this:
var data = $('#mainDivOnMiddleOfPage').html();
var pattern = "<(a*).*href=.*>.*</a>";
var modi = "g";
var patt = new RegExp(pattern, modi);
var result = patt.exec(data);
jQuery gets the content of the page. This is tested and verified.
Question is, why does this return null in JavaScript but what it is supposed to return in the regexp tester?
Upvotes: 1
Views: 355
Reputation: 31991
"The point of the regexp is to go through a sites HTML and find all of the links. It should then return these in an Array for me to manipulate."
I won't add another regex answer, but just want to point out that if you have hold of the document (not just the html) then it's easier to walk trhough the links collection. That contains all <a href="">
's but also all <area>
elements:
for (var link, links = document.links, n = links.length, i=0; i<n; i++){
link = links[i];
switch (link.tagName){
case "A":
//do something with the link
break;
case "AREA":
//do something with the area.
break;
}
}
Upvotes: 1
Reputation: 34598
I'm conscious an answer has been chosen. However it's worth mentioning that the current REGEX solutions match the tags but not the actual HREFs in isolation.
This is where JavaScript falls down, since its somewhat simplistic implementation of REGEX does not allow for the capturing of sub-groups when the global g
flag is specified.
One way round this is to exploit the REGEX replacement callback. This will get just the link HREFs, not the tags.
var html = document.body.innerHTML,
links = [];
html.replace(/<a[^>]*?href=('|")(.*?)\1/gi, function($0, $1, $2) {
links.push($2);
});
//links is now an array of hrefs
It also uses a back-reference to close the href
attribute, i.e. making sure both opening and closing quote are single or double, not mixed.
Sidenote: as others have mentioned, where possible, you'd want to DOM this rather than REGEX.
Upvotes: 1
Reputation: 16468
Your problem is that you are not compiling your regex:
patt.compile();
You have to call it before using with the exec()
method.
Upvotes: 0
Reputation: 1418
Going to go ahead and post this here, since I think it's what you want -- it is not a RegEx solution, however.
$(function(){
$.ajax({
url: "test.htm",
success: function(data){
var array_of_links = $.makeArray($("a",data));
// do your stuff here
}
});
});
Upvotes: 1
Reputation: 43703
All <a>
links:
<a[^>]*?\bhref=['\"](.*?)['\"]
Absolute links only (starting with http
):
<a[^>]*?\bhref=['\"](http.*?)['\"]
JavaScript code:
var html = '<a href="test.html">';
var m = html.match(/<a[^>]*?\bhref=['"](.*?)['"]/);
print (m[1]);
See and test the code here.
Upvotes: 1
Reputation: 2681
I use the following code to do the same thing and it works for me, try it out
var data = document.getElementById('mainDivOnMiddleOfPage').textContent;
var result = data.match(/<(a*).*href=.*>.*<\/a>/);
Upvotes: 1