Justin Kredible
Justin Kredible

Reputation: 8414

Java Regex - Extract link from HTML anchor

I have the following code

private String anchorRegex = "\\<\\s*?a\\s+.*?href\\s*?=\\s*?([^\\s]*?).*?\\>";
private Pattern anchorPattern = Pattern.compile(anchorRegex, Pattern.CASE_INSENSITIVE);
String content = getContentAsString();
Matcher matcher = anchorPattern.matcher(content);

while(matcher.find()) {
    System.out.println(matcher.group(1));
}

The call to getContentAsString() returns the HTML content from a web page. The problem I'm having is that the only thing that gets printed in my System.out is a space. Can anyone see what's wrong with my regex?

Regex drives me crazy sometimes.

Upvotes: 1

Views: 2215

Answers (3)

user557597
user557597

Reputation:

This should be able to pull out the href without too much trouble.
The link is in capture group 2, its expanded and assumes dot-all.
Use Java delimiters as necessary.

(?s)
<a 
  (?=\s) 
  (?:[^>"']|"[^"]*"|'[^']*')*? (?<=\s) href \s*=\s* (['"]) (.*?) \1 
  (?:".*?"|'.*?'|[^>]*?)+ 
>

or not expanded, not dot-all.

<a(?=\s)(?:[^>"']|"[^"]*"|'[^']*')*?(?<=\s)href\s*=\s*(['"])([\s\S]*?)\1(?:"[\s\S]*?"|'[\s\S]*?'|[^>]*?)+>

Upvotes: 0

anubhava
anubhava

Reputation: 786031

The regex you should be using is this:

String anchorRegex = "(?s)<\\s*a\\s+.*?href\\s*=\\s*['\"]([^\\s>]*)['\"]";

Upvotes: 1

beerbajay
beerbajay

Reputation: 20300

You need to delimit your capturing group from the following .*?. There's probably double quotes " around the href, so use those:

<\s*a\s+.*?href\s*=\s*"(\S*?)".*?>

Your regex contains:

([^\s]*?).*?

The ([^\s]*?) says to reluctantly find all non-whitespace characters and save them in a group. But the reluctant *? depends on the next part, which is .; any character. So the matching of the href aborts at the first possible chance and it is the .*? which matches the rest of the URL.

Upvotes: 1

Related Questions