Reputation: 6228
Hi I am trying to extract text which a href defines in a html line. For example:
<link rel="stylesheet" href="style.css" type="text/css">
I want to get "style.css" or:
<a href="target0.html"><img align="center" src="thumbnails/image001.jpg" width="154" height="99">
I want to get "target0.html"
What would be the correct Java code to do this?
Upvotes: 0
Views: 460
Reputation: 115328
I have not try the following but it should be something like this:
'Pattern.compile("<(?:link|a\s+)[^>]*href=\"(.*?)\"")'
But I'd recommend you to use one of available HTML or even XML parsers for this task.
Upvotes: 0
Reputation: 28638
I realize you asked about using regular expressions, but jsoup makes this so simple and is much less error prone:
import java.io.IOException;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.xml.sax.SAXException;
public class HrefExtractor {
public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
final Document document = Jsoup.parse("<a href=\"target0.html\"><img align=\"center\" src=\"thumbnails/image001.jpg\" width=\"154\" height=\"99\">");
final Elements links = document.select("a[href]");
for (final Element element : links) {
System.out.println(element.attr("href"));
}
}
}
Upvotes: 1
Reputation: 387
public static String getHref(String str)
{
int startIndex = str.indexOf("href=");
if (startIndex < 0)
return "";
return str.substring(startIndex + 6, str.indexOf("\"", startIndex + 6));
}
This method assumes that the html is well formed and it only works for the first href in the string but I'm sure you can extrapolate from here.
Upvotes: 1