Scrapy and nested tags in a table - xpath help when following-sibling

Question

I'm new to scrapy and having some trouble extracting text from nested tags in a table.

The example tutorials I found mostly still reference the old deprecated HtmlXPathSelector, however I'm using the new "selector" import found in newer scrapy v 0.22 (I believe HtmlXPathSelector was deprecated sometime in 2013).

I seem to have the basic xpath extraction working, however when I also try and extract each item, my attempt is failing. My use of extract() methods seems to generate an error relating to unicode?

I would like to just return the < TD > values below each as an item.

-- Table Example --

< table cellspacing="0" class="contenttable company-details">
   
      Item Code
      IT123
   
      Listing Date
      12 September, 1996
   
   
      Internet Address
      http://www.website.com/
   
   
      Office Address
      123 Example Street
       
    
      Office Telephone
      (01) 1234 5678

The following link was somewhat helpful in understanding nested tags:

https://stackoverflow.com/questions/18928303/parsing-adjacent-items-in-scrapy

-- Spider example --

from scrapy.spider import Spider
from scrapy.selector import Selector
from my.items import MyItem

class MySpider(Spider):
   name = "my"
   allowed_domains = ["website.com"]
   start_urls = ["http://www.website.com/test" ]

def parse(self, response):
   sel = Selector(response)
   sites = sel.xpath('//table[@class="contenttable company-details"]//tr').extract()
   items = []
   for site in sites:
      item = MyItem()
      item['item_code'] = site.xpath('//th[text()="Item Code"]/following-sibling::td//text()').extract()
      #item['listing_date'] = site.xpath(...
      #item['web_url'] = site.xpath(...
      #item['office_address'] = site.xpath(...
      #item['office_phone'] = site.xpath(...
      items.append(item)
   return items

I've #commented out some of the other items until I can get this working, however they obviously relate to respective < TH > tags...

If I use extract() method on sites, it seems to break items, I get the following error.

File "/blah/../my_spider.py", line 20, in parse
    item['item_code'] = site.xpath('//th[text()="Item Code"]/following-sibling::td//text()').extract()

 exceptions.AttributeError: 'unicode' object has no attribute 'xpath'

If I remove extract() from sites, then that part just returns garbage...

On their own, the xpath queries appear to work ok via scrapy debug console, combining them into something useful is the problem ...

1.)   sel.xpath('//table[@class="contenttable company-details"]//tr').extract()
2.)   sel.xpath('//th[text()="Item Code"]/following-sibling::td//text()').extract()

paul trmbrth · Accepted Answer

Some explanation:

sel.xpath('//table[@class="contenttable company-details"]//tr') returns a SelectorList
and sel.xpath('//table[@class="contenttable company-details"]//tr').extract() thus returns a list of unicode strings (serializing the selected elements, here as HTML snippets)

What you need is looping on each site in sites as Selector, so the fix is to drop the .extract() call, to have

    sites = sel.xpath('//table[@class="contenttable company-details"]//tr')
    items = []
    for site in sites:
        item = MyItem()
        item['item_code'] = site.xpath('.//th[text()="Item Code"]/following-sibling::td//text()').extract()
        ...

Note: in the loop you are using absolute XPath expressions (//...) while you really need relative XPath expressions (.//...), relative to your current selected element, which I fixed in the code snippet above. (it could probably be further simplified to site.xpath('th[text()="Item Code"]/following-sibling::td//text()').extract())

Scrapy and nested tags in a table - xpath help when following-sibling

Answers (1)

Related Questions