Reputation: 8414
I have the following code
private String anchorRegex = "\\<\\s*?a\\s+.*?href\\s*?=\\s*?([^\\s]*?).*?\\>";
private Pattern anchorPattern = Pattern.compile(anchorRegex, Pattern.CASE_INSENSITIVE);
String content = getContentAsString();
Matcher matcher = anchorPattern.matcher(content);
while(matcher.find()) {
System.out.println(matcher.group(1));
}
The call to getContentAsString()
returns the HTML content from a web page. The problem I'm having is that the only thing that gets printed in my System.out is a space. Can anyone see what's wrong with my regex?
Regex drives me crazy sometimes.
Upvotes: 1
Views: 2215
Reputation:
This should be able to pull out the href without too much trouble.
The link is in capture group 2, its expanded and assumes dot-all.
Use Java delimiters as necessary.
(?s)
<a
(?=\s)
(?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s) href \s*=\s* (['"]) (.*?) \1
(?:".*?"|'.*?'|[^>]*?)+
>
or not expanded, not dot-all.
<a(?=\s)(?:[^>"']|"[^"]*"|'[^']*')*?(?<=\s)href\s*=\s*(['"])([\s\S]*?)\1(?:"[\s\S]*?"|'[\s\S]*?'|[^>]*?)+>
Upvotes: 0
Reputation: 786031
The regex you should be using is this:
String anchorRegex = "(?s)<\\s*a\\s+.*?href\\s*=\\s*['\"]([^\\s>]*)['\"]";
Upvotes: 1
Reputation: 20300
You need to delimit your capturing group from the following .*?
. There's probably double quotes "
around the href, so use those:
<\s*a\s+.*?href\s*=\s*"(\S*?)".*?>
Your regex contains:
([^\s]*?).*?
The ([^\s]*?)
says to reluctantly find all non-whitespace characters and save them in a group. But the reluctant *?
depends on the next part, which is .
; any character. So the matching of the href aborts at the first possible chance and it is the .*?
which matches the rest of the URL.
Upvotes: 1