Reputation: 1823
I'm trying to scrape this site : http://www.kaymu.com.ng/.
The part of the HTML I'm scraping is like this:
<ul id="navigation-menu">
<li> some content </li>
<li> some content </li>
...
<li> some content </li>
</ul>
This is my spider :
class KaymuSpider(Spider):
name = "kaymu"
allowed_domains = ["kaymu.com.ng"]
start_urls = [
"http://www.kaymu.com.ng"
]
def parse(self, response):
sel = response.selector
menu = sel.xpath('//ul[@id="navigation-menu"]/li')
The menu has only the last li element in the list. I'm not sure why it's behaving like this, when the syntax is right to select all the li elements. Please suggest what might be wrong, thanks!
Upvotes: 2
Views: 791
Reputation: 473873
The problem is that the menu is constructed dynamically with the help of the browser executing javascript. Scrapy
is not a browser and doesn't have a javascript engine built-in.
Hopefully, there is a script
tag containing a javascript array of menu objects. We can locate the desired script
tag, extract the javascript array, load it into a Python list with the help of json
module and print out the menu item names.
Demo from the "Scrapy Shell":
$ scrapy shell http://www.kaymu.com.ng/
In [1]: script = response.xpath("//script[contains(., 'categoryData')]/text()").extract()[0]
In [2]: import re
In [3]: pattern = re.compile(r'var categoryData = (.*?);\n')
In [4]: data = pattern.search(script).group(1)
In [5]: import json
In [6]: data = json.loads(data)
In [7]: for item in data:
....: print item['name']
....:
Fashion
Jewelry & Watches
Health & Beauty
Sporting Goods
Mobile Phones & Tablets
Audio, Video & Gaming
Computers, Laptops & Accessories
Appliances, Furniture & Decor
Books & Media
Babies & Kids
Food & Beverages
Other
Upvotes: 2