Alex Mathew
Alex Mathew

Reputation: 1554

How to find hyperlink in a webpage using java?

how we can find out the no of hyperlinks in a page.
and how to find out what all are they?? i need to develop the stuff in plan java not in any frame work which means,by using
JAVA.NET.*; method,any scope?how can i do that?
can you guys give me a proper example??

i need to get all the links in the page and i need to save that in the database,all the links with domain name

Upvotes: 0

Views: 9228

Answers (5)

Pulak
Pulak

Reputation: 1

    Pattern p = Pattern.compile("(https?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?)");

    Matcher m = p.matcher(br.toString());


    while (m.find() == true) {

        resp.getWriter().print("<a href="+m.group(0).toString()+">"+m.group(0).toString()+"</a><br/>");
      }

Upvotes: 0

naikus
naikus

Reputation: 24472

You can use the javax.swing.text.html and javax.swing.text.html.parser packages to achieve this:

import java.io.*;
import java.net.URL;
import java.util.Enumeration;

import javax.swing.text.MutableAttributeSet;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Test {
   public static void main(String[] args) throws Exception  {
      Reader r = null;

      try   {
         URL u = new URL(args[0]);
         InputStream in = u.openStream();
         r = new InputStreamReader(in);

         ParserDelegator hp = new ParserDelegator();
         hp.parse(r, new HTMLEditorKit.ParserCallback() {
            public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
               // System.out.println(t);
               if(t == HTML.Tag.A)  {
                  Enumeration attrNames = a.getAttributeNames();
                  StringBuilder b = new StringBuilder();
                  while(attrNames.hasMoreElements())    {
                      Object key = attrNames.nextElement();
                      if("href".equals(key.toString())) {
                          System.out.println(a.getAttribute(key));
                      }
                  }
               }
            }
         }, true);
      }finally {
         if(r != null)  {
            r.close();
         }
      }
   }
}

Compile and call it this way:

java Test http://www.oracle.com/technetwork/java/index.html

Upvotes: 5

Impiastro
Impiastro

Reputation: 858

Try using the jsoup library.

Download the project jar and compile this code snippet:

    Document doc = Jsoup.parse(new URL("http://www.bits4beats.it/"), 2000);

    Elements resultLinks = doc.select("a");
    System.out.println("number of links: " + resultLinks.size());
    for (Element link : resultLinks) {
        System.out.println();
        String href = link.attr("href");
        System.out.println("Title: " + link.text());
        System.out.println("Url: " + href);
    }

The code prints the numbers of hypertext elements in a html page and infos about them.

Upvotes: 5

camickr
camickr

Reputation: 324098

Getting Links in an HTML Document

Upvotes: 3

Gopi
Gopi

Reputation: 10293

Best option is use some html parser library but if you dont want to use any such third party library you may try to do this by matching with regular expression using java's Pattern and Matcher classes from the regex package.

Edit Example:

String regex="\\b(?<=(href=\"))[^\"]*?(?=\")";
Pattern pattern = Pattern.compile(regex);

Matcher m = pattern.matcher(str_YourHtmlHere);
while(m.find()) {
  System.out.println("FOUND: " + m.group());
}

In above example is a simple basic regex which will find all links indicated by attribute href. You may have to enhance the regex for correctly handling all scenarios such as href with url in single quote etc.

Upvotes: 3

Related Questions