smartcode
smartcode

Reputation: 247

How to get web page title using html parser

How can I get the title of a web page for a given URL using an HTML parser? Is it possible to get the title using regular expressions? I would prefer to use an HTML parser.

I am working in the Java Eclipse IDE.

I have tried using the following code, but was unsuccessful.

Any ideas?

Thank in advance!

import org.htmlparser.Node;

import org.htmlparser.Parser;

import org.htmlparser.util.NodeList;

import org.htmlparser.util.ParserException;

import org.htmlparser.tags.TitleTag;    

public class TestHtml {

public static void main(String... args) {
    Parser parser = new Parser();     
    try {
        parser.setResource("http://www.yahoo.com/");
        NodeList list = parser.parse(null);
        Node node = list.elementAt(0);

        if (node instanceof TitleTag) {
           TitleTag title = (TitleTag) node;


            System.out.println(title.getText());

        }

    } catch (ParserException e) {
        e.printStackTrace();
    }
}

}

Upvotes: 0

Views: 5423

Answers (5)

GPU..
GPU..

Reputation: 175

This will be very easy using HTMLAgilityPack you only need to get responce of httpRequest in the form of string.

    String response=httpRequest.getResponseString(); // this may have a few changes or no 
HtmlDocument doc= new HtmlDocument();
doc.loadHtml(response);
HtmlNode node =doc.DocumentNode.selectSingleNode("//title"); // this line will fetch title tage from whole html document and return collection could iterate
node.innerText; //gives you the title of the page

helloWorld node.innerText contains helloWorld

OR

String response=httpRequest.getResponseString(); // this may have a few changes or no 
HtmlDocument doc= new HtmlDocument();
doc.loadHtml(response);

HtmlNode node =doc.DocumentNode.selectSingleNode("//head");// this additional will get head which is a single node in html than get title from head's childrens
HtmlNode node =node.selectSingleNode("//title"); // this line will fetch title tage from whole html document and return collection could iterate


node.innerText; //gives you the title of the page

Upvotes: 0

madhurtanwani
madhurtanwani

Reputation: 1219

BTW there is already a very simple title extract that ships with HTMLParser. You can use that : http://htmlparser.sourceforge.net/samples.html

The method to run it is (from within the HtmlParser code base) : Run :

bin/parser http://website_url TITLE

or run

java -jar <path to htmlparser.jar> http://website_url TITLE

or from your code call the method

org.htmlparser.Parser.main(String[] args)

with the parameters new String[] {"<website url>", "TITLE"}

Upvotes: 1

Vinze
Vinze

Reputation: 2539

According to your (redefined) question, the problem is that you only check the first node Node node = list.elementAt(0); while you should iterate over the list to find the title (which is not the first). You could also use a NodeFilter for your parse() to only return the TitleTag and then the title would be in the first and you wouldn't have to iterate.

Upvotes: 3

Vinze
Vinze

Reputation: 2539

Well - assuming you're using java, but there is the equivalent in most of the languages - you can use a SAX parser (such as TagSoup which transform any html to xhtml) and in your handler you can do :

public class MyHandler extends org.xml.sax.helpers.DefaultHandler {
    boolean readTitle = false;
    StringBuilder title = new StringBuilder();

    public void startElement(String uri, String localName, String name,
                Attributes attributes) throws SAXException {
        if(localName.equals("title") {
            readTitle = true;
        }
    }

    public void endElement(String uri, String localName, String name)
            throws SAXException {
        if(localName.equals("title") {
            readTitle = false;
        }
    }

    public void characters(char[] ch, int start, int length)
            throws SAXException {
        if(readTitle) title.append(new String(ch, start, length));
    }
}

and you use it in your parser (example with tagsoup) :

org.ccil.cowan.tagsoup.Parser parser = new Parser();
MyHandler handler = new MyHander();
parser.setContentHandler(handler);
parser.parse(an input stream to your html file);
return handler.title.toString();

Upvotes: 1

Borealid
Borealid

Reputation: 98469

RegEx match open tags except XHTML self-contained tags

Smart you don't want to use the Regex.

To use an HTML parser, we need to know which language you're using. Since you say you're "on eclipse", I'm going to assume Java.

Take a look at http://www.ibm.com/developerworks/xml/library/x-domjava/ for a description, overview, and various viewpoints.

Upvotes: 0

Related Questions