vivekanon
vivekanon

Reputation: 1823

Why is xpath selecting only the last <li> inside the <ul>?

I'm trying to scrape this site : http://www.kaymu.com.ng/.

The part of the HTML I'm scraping is like this:

<ul id="navigation-menu">
    <li> some content </li>
    <li> some content </li>
    ...
    <li> some content </li>
</ul>

This is my spider :

class KaymuSpider(Spider):
    name = "kaymu"
    allowed_domains = ["kaymu.com.ng"]
    start_urls = [
        "http://www.kaymu.com.ng"
    ]

    def parse(self, response):
        sel = response.selector
        menu = sel.xpath('//ul[@id="navigation-menu"]/li')

The menu has only the last li element in the list. I'm not sure why it's behaving like this, when the syntax is right to select all the li elements. Please suggest what might be wrong, thanks!

Upvotes: 2

Views: 791

Answers (1)

alecxe
alecxe

Reputation: 473873

The problem is that the menu is constructed dynamically with the help of the browser executing javascript. Scrapy is not a browser and doesn't have a javascript engine built-in.

Hopefully, there is a script tag containing a javascript array of menu objects. We can locate the desired script tag, extract the javascript array, load it into a Python list with the help of json module and print out the menu item names.

Demo from the "Scrapy Shell":

$ scrapy shell http://www.kaymu.com.ng/

In [1]: script = response.xpath("//script[contains(., 'categoryData')]/text()").extract()[0]

In [2]: import re

In [3]: pattern = re.compile(r'var categoryData = (.*?);\n')

In [4]: data = pattern.search(script).group(1)

In [5]: import json

In [6]: data = json.loads(data)

In [7]: for item in data:
   ....:     print item['name']
   ....:     
Fashion
Jewelry & Watches
Health & Beauty
Sporting Goods
Mobile Phones & Tablets
Audio, Video & Gaming
Computers, Laptops & Accessories
Appliances, Furniture & Decor
Books & Media
Babies & Kids
Food & Beverages
Other

Upvotes: 2

Related Questions