XPath: Help in locating a specific element in a DOM scraped using HTMLUnit

Question

I am scraping a webpage using HTMLUnit and have collected a List of DOM nodes from the webpage.

Inside each of these "company" DOM nodes is some data I want to scrape. For example I want the telephone number text from inside this node:

Now, this element would be a child of a div element which is in turn a child of another div element inside the company node. What would be the correct XPath line to access it? Here is my latest attempt which returned nothing.

 List companies = (List) page.getByXPath("//li[@class='featured block twoblock    boxshadow']");
        for (int j = 0; j < companies.size(); j++) {

            DomNode company = companies.get(j);

                // retrieve telephone number
                DomNode telephone = (DomNode) company.getByXPath(
                        "//li[@data-pvd-p='"+j+1+"']/div[@class='listingWrapper']/div[@class='itemInfo']/span[@class='tel']").get(0);

}

Here is a sample of the HTML, what:

UPDATE 2:

Here is the current state of my XPath code.

List companies = (List) page
                .getByXPath("//li[contains(@class, 'featured block')]");
        for (int j = 0; j < companies.size(); j++) {

            String url = "";
            DomNode company = companies.get(j);
            DomElement web = null;

            // retrieve name
            DomNode name = (DomNode) company.getByXPath("//a[@itemprop='name']").get(j);

            if (companiesLogged.contains(name.getTextContent().trim()) != true) {
                companiesLogged.add(name.getTextContent().trim());

                // retrieve telephone number
                DomNode telephone = (DomNode) company.getByXPath("div[@class='listingWrapper']/div[@class='itemInfo']/span[@class='tel']").get(0);


                // retrieve website
                try{
                web = (DomElement) company.getByXPath("div[@class='listingWrapper']/div[@class='itemInfo']" +
                        "/div[@class='listLinks']/a[@id='linkWebsite']").get(0);
                } catch(IndexOutOfBoundsException e){
                    System.out.print(" (No Website) ");
                }

                try{
                url = web.getAttribute("href");
                } catch (IndexOutOfBoundsException e){
                    url = "N/A";
                }

                System.out.println(name.getTextContent().trim() + "   "
                        + telephone.getTextContent().trim()
                 +"   "+url.trim());

            } else {
                System.out.println("Company already logged");
            }
        }

JWiley · Accepted Answer

First thing I see is how you're retrieving the group of

nodes. Just looking at your @class attribute, you can't really tell how many spaces are in "featured block twoblock boxshadow", but that XPath will only return a result if it is exactly equal to it. In that regard, try using something more flexible like contains(), i.e. //li[contains(@class, 'featured block')].

Without seeing what source you're targeting I can't suggest much more, but will update the answer when it's added to the question.

I've tried your XPath (just the /div part, since that's what was provided) on the given snippet and got back as a result. Looks like an issue with how you're retrieving the

company nodes.

Update 2: From your updated XML snippet, your first XPath //li[@class='featured block twoblock boxshadow']" doesn't look like it will match the parent

node, based on what I mentioned with the spaces before. Secondly if it did, you are checking the

node's attributes twice on separate queries, and assuming that the index you're giving the data-pvd-p value (starts at 3 in the snippet) will always match the list index (starts at 0, with your +1 added). I'd suggest removing this portion //li[@data-pvd-p='"+j+1+"'] and beginning with the //div.

So something like this:

List companies = (List) page.getByXPath("//li[contains(@class, 'featured block']");
        for (DomNode node : companies) {

                // retrieve telephone number
                DomNode telephone = (DomNode) node.getByXPath(
                        "div[@class='listingWrapper']/div[@class='itemInfo']/span[@class='tel']").get(0);

XPath: Help in locating a specific element in a DOM scraped using HTMLUnit

Answers (1)

Related Questions