Reputation: 492
< a href=" http://www.google.com " > Google < /a> < br/> //without the spaces
I'm trying to extract the link http://www.google.com as well as the text Google
Upvotes: 2
Views: 82
Reputation: 626
I use the filter API in my web crawler, and it works perfectly.
Here is the API code:
public static String filterHref( String hrefLine )
{
String link = hrefLine;
if ( !link.toLowerCase().contains( "href" ) )
return "";
String[] hrefSplit = hrefLine.split( "href" ); // split href="..." alt="...">...<...>
link = hrefSplit[ 1 ].split( "\\s+" )[ 0 ]; // get href attribute and value
if ( link.contains( ">" ) )
link = link.substring( 0, link.indexOf( ">" ) );
link = link.replaceFirst( "=", "" );
link = link.replace( "\"", "" ).replace( "'", "" ).trim();
return link;
}
Upvotes: 0
Reputation: 1140
You can extract it by using a simple regex. Try this.
String s = "<a href=\"http://www.google.com\">Google</a><br/>";
Pattern pattern = Pattern.compile("<a\\s+href=\"([^\"]*)\">([^<]*)</a>");
Matcher matcher = pattern.matcher(s);
if (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
Upvotes: 0
Reputation: 3651
This should do the job.
String url = "<a href=\"http://www.google.com\">Google</a><br/>";
String[] separate = url.split("\"");
String URL = separate[1];
String text = separate[2].substring(1).split("<")[0];
Upvotes: 1