Reputation: 1554
how we can find out the no of hyperlinks in a page.
and how to find out what all are they??
i need to develop the stuff in plan java not in any frame work which means,by using
JAVA.NET.*; method,any scope?how can i do that?
can you guys give me a proper example??
i need to get all the links in the page and i need to save that in the database,all the links with domain name
Upvotes: 0
Views: 9228
Reputation: 1
Pattern p = Pattern.compile("(https?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?)");
Matcher m = p.matcher(br.toString());
while (m.find() == true) {
resp.getWriter().print("<a href="+m.group(0).toString()+">"+m.group(0).toString()+"</a><br/>");
}
Upvotes: 0
Reputation: 24472
You can use the javax.swing.text.html and javax.swing.text.html.parser packages to achieve this:
import java.io.*;
import java.net.URL;
import java.util.Enumeration;
import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Test {
public static void main(String[] args) throws Exception {
Reader r = null;
try {
URL u = new URL(args[0]);
InputStream in = u.openStream();
r = new InputStreamReader(in);
ParserDelegator hp = new ParserDelegator();
hp.parse(r, new HTMLEditorKit.ParserCallback() {
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
// System.out.println(t);
if(t == HTML.Tag.A) {
Enumeration attrNames = a.getAttributeNames();
StringBuilder b = new StringBuilder();
while(attrNames.hasMoreElements()) {
Object key = attrNames.nextElement();
if("href".equals(key.toString())) {
System.out.println(a.getAttribute(key));
}
}
}
}
}, true);
}finally {
if(r != null) {
r.close();
}
}
}
}
Compile and call it this way:
java Test http://www.oracle.com/technetwork/java/index.html
Upvotes: 5
Reputation: 858
Try using the jsoup library.
Download the project jar and compile this code snippet:
Document doc = Jsoup.parse(new URL("http://www.bits4beats.it/"), 2000);
Elements resultLinks = doc.select("a");
System.out.println("number of links: " + resultLinks.size());
for (Element link : resultLinks) {
System.out.println();
String href = link.attr("href");
System.out.println("Title: " + link.text());
System.out.println("Url: " + href);
}
The code prints the numbers of hypertext elements in a html page and infos about them.
Upvotes: 5
Reputation: 10293
Best option is use some html parser library but if you dont want to use any such third party library you may try to do this by matching with regular expression using java's Pattern and Matcher classes from the regex package.
Edit Example:
String regex="\\b(?<=(href=\"))[^\"]*?(?=\")";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(str_YourHtmlHere);
while(m.find()) {
System.out.println("FOUND: " + m.group());
}
In above example is a simple basic regex which will find all links indicated by attribute href. You may have to enhance the regex for correctly handling all scenarios such as href with url in single quote etc.
Upvotes: 3