user3264669
user3264669

Reputation: 5

Scrapy and nested tags in a table - xpath help when following-sibling

I'm new to scrapy and having some trouble extracting text from nested tags in a table.

The example tutorials I found mostly still reference the old deprecated HtmlXPathSelector, however I'm using the new "selector" import found in newer scrapy v 0.22 (I believe HtmlXPathSelector was deprecated sometime in 2013).

I seem to have the basic xpath extraction working, however when I also try and extract each item, my attempt is failing. My use of extract() methods seems to generate an error relating to unicode?

I would like to just return the < TD > values below each as an item.

-- Table Example --

< table cellspacing="0" class="contenttable company-details">
   <tr>
      <th>Item Code</th>
      <td>IT123</td>
   </tr>
      <th>Listing Date</th>
      <td>12 September, 1996</td>
   </tr>
   <tr>
      <th>Internet Address</th>
      <td class="altrow"><a href="http://www.website.com/" target="_top">http://www.website.com/</a></td>
   </tr>
   <tr>
      <th>Office Address</th>
      <td>123 Example Street</td>
   </tr>    
    <tr>
      <th>Office Telephone</th>
      <td>(01) 1234 5678</td>
    </tr>       
 </table>

The following link was somewhat helpful in understanding nested tags:

https://stackoverflow.com/questions/18928303/parsing-adjacent-items-in-scrapy

-- Spider example --

from scrapy.spider import Spider
from scrapy.selector import Selector
from my.items import MyItem

class MySpider(Spider):
   name = "my"
   allowed_domains = ["website.com"]
   start_urls = ["http://www.website.com/test" ]

def parse(self, response):
   sel = Selector(response)
   sites = sel.xpath('//table[@class="contenttable company-details"]//tr').extract()
   items = []
   for site in sites:
      item = MyItem()
      item['item_code'] = site.xpath('//th[text()="Item Code"]/following-sibling::td//text()').extract()
      #item['listing_date'] = site.xpath(...
      #item['web_url'] = site.xpath(...
      #item['office_address'] = site.xpath(...
      #item['office_phone'] = site.xpath(...
      items.append(item)
   return items

I've #commented out some of the other items until I can get this working, however they obviously relate to respective < TH > tags...

If I use extract() method on sites, it seems to break items, I get the following error.

File "/blah/../my_spider.py", line 20, in parse
    item['item_code'] = site.xpath('//th[text()="Item Code"]/following-sibling::td//text()').extract()

 exceptions.AttributeError: 'unicode' object has no attribute 'xpath'

If I remove extract() from sites, then that part just returns garbage...

On their own, the xpath queries appear to work ok via scrapy debug console, combining them into something useful is the problem ...

1.)   sel.xpath('//table[@class="contenttable company-details"]//tr').extract()
2.)   sel.xpath('//th[text()="Item Code"]/following-sibling::td//text()').extract()

Upvotes: 0

Views: 2728

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

Some explanation:

What you need is looping on each site in sites as Selector, so the fix is to drop the .extract() call, to have

    sites = sel.xpath('//table[@class="contenttable company-details"]//tr')
    items = []
    for site in sites:
        item = MyItem()
        item['item_code'] = site.xpath('.//th[text()="Item Code"]/following-sibling::td//text()').extract()
        ...

Note: in the loop you are using absolute XPath expressions (//...) while you really need relative XPath expressions (.//...), relative to your current selected element, which I fixed in the code snippet above. (it could probably be further simplified to site.xpath('th[text()="Item Code"]/following-sibling::td//text()').extract())

Upvotes: 1

Related Questions