Ali
Ali

Reputation: 1879

Parse href out of html document and evaluating by xpath returns null pointer exception

I am going to parse URLs in specific location of one website. For this purpose I wrote a simple program in Java. But this program returns null pointer exception. It seems that getNameItem("href") returns null. I am suspicious about wrong way of using getNameItem to extract URLs inside "href" tag.

DocumentBuilder b = DocumentBuilderFactory.newInstance().newDocumentBuilder();
org.w3c.dom.Document doc = b.parse(new FileInputStream("clean.html"));

//Evaluate XPath against Document itself
javax.xml.xpath.XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList)xPath.evaluate(".//*[@class='r_news_box']",
        doc.getDocumentElement(), XPathConstants.NODESET);
for (int i = 0; i < nodes.getLength(); ++i) {
    Element e = (Element) nodes.item(i);
    System.out.println(e.getAttributes().getNamedItem("href").getTextContent());
}

P.S: here is one of the nodes that should be selected by this xpath:

<div class="r_news_box">
<a class="picLink" target="_blank" href="/fa/news/427583/test">
<img class="r_news_img" width="50" height="65" src="/files/fa/news/1393/5/29/411217_553.jpg" alt="test"/>
</a>

Upvotes: 0

Views: 283

Answers (2)

har07
har07

Reputation: 89325

Possibly because not all nodes selected has href attribute. You may want to change your XPath to make sure only elements having href attribute are returned :

.//*[@class='r_news_box' and @href]

UPDATE :

According to your update, href is the attribute of <a> node within an element having class attribute equals r_news_box, so here is corrected XPath :

.//*[@class='r_news_box']/a[@href]

Upvotes: 1

Lars
Lars

Reputation: 1750

Writing an html parser with XML Parser Librarys is not a good idea. Most html sites are not valid xml documents. You can better use a html parser like jsoup. It is really easy to use and self explained. Here is an example.

Upvotes: 0

Related Questions