Reputation: 5
I'm new to scrapy and having some trouble extracting text from nested tags in a table.
The example tutorials I found mostly still reference the old deprecated HtmlXPathSelector, however I'm using the new "selector" import found in newer scrapy v 0.22 (I believe HtmlXPathSelector was deprecated sometime in 2013).
I seem to have the basic xpath extraction working, however when I also try and extract each item, my attempt is failing. My use of extract()
methods seems to generate an error relating to unicode?
I would like to just return the < TD > values below each as an item.
-- Table Example --
< table cellspacing="0" class="contenttable company-details">
<tr>
<th>Item Code</th>
<td>IT123</td>
</tr>
<th>Listing Date</th>
<td>12 September, 1996</td>
</tr>
<tr>
<th>Internet Address</th>
<td class="altrow"><a href="http://www.website.com/" target="_top">http://www.website.com/</a></td>
</tr>
<tr>
<th>Office Address</th>
<td>123 Example Street</td>
</tr>
<tr>
<th>Office Telephone</th>
<td>(01) 1234 5678</td>
</tr>
</table>
The following link was somewhat helpful in understanding nested tags:
https://stackoverflow.com/questions/18928303/parsing-adjacent-items-in-scrapy
-- Spider example --
from scrapy.spider import Spider
from scrapy.selector import Selector
from my.items import MyItem
class MySpider(Spider):
name = "my"
allowed_domains = ["website.com"]
start_urls = ["http://www.website.com/test" ]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//table[@class="contenttable company-details"]//tr').extract()
items = []
for site in sites:
item = MyItem()
item['item_code'] = site.xpath('//th[text()="Item Code"]/following-sibling::td//text()').extract()
#item['listing_date'] = site.xpath(...
#item['web_url'] = site.xpath(...
#item['office_address'] = site.xpath(...
#item['office_phone'] = site.xpath(...
items.append(item)
return items
I've #commented out some of the other items until I can get this working, however they obviously relate to respective < TH > tags...
If I use extract() method on sites, it seems to break items, I get the following error.
File "/blah/../my_spider.py", line 20, in parse
item['item_code'] = site.xpath('//th[text()="Item Code"]/following-sibling::td//text()').extract()
exceptions.AttributeError: 'unicode' object has no attribute 'xpath'
If I remove extract() from sites, then that part just returns garbage...
On their own, the xpath queries appear to work ok via scrapy debug console, combining them into something useful is the problem ...
1.) sel.xpath('//table[@class="contenttable company-details"]//tr').extract()
2.) sel.xpath('//th[text()="Item Code"]/following-sibling::td//text()').extract()
Upvotes: 0
Views: 2728
Reputation: 20748
Some explanation:
sel.xpath('//table[@class="contenttable company-details"]//tr')
returns a SelectorList
sel.xpath('//table[@class="contenttable company-details"]//tr').extract()
thus returns a list of unicode strings (serializing the selected elements, here as HTML snippets)What you need is looping on each site in sites
as Selector
, so the fix is to drop the .extract()
call, to have
sites = sel.xpath('//table[@class="contenttable company-details"]//tr')
items = []
for site in sites:
item = MyItem()
item['item_code'] = site.xpath('.//th[text()="Item Code"]/following-sibling::td//text()').extract()
...
Note: in the loop you are using absolute XPath expressions (//...
) while you really need relative XPath expressions (.//...
), relative to your current selected element, which I fixed in the code snippet above. (it could probably be further simplified to site.xpath('th[text()="Item Code"]/following-sibling::td//text()').extract()
)
Upvotes: 1