Reputation: 2418
I have a div element which contains more child elements. I want to crape only the text from all the child elements of that div. Is there any inbuilt funcion or scrapy property for that.
example : I need to scrape the breadcrumb from http://www.jabong.com/z-collection-Olive-Mocassins-376735.html
div id to scrape content from : breadcrumbs Desired Output: Home > Men > Shoes > Casual Shoes > Moccasins > Olive Mocassins
Upvotes: 0
Views: 1675
Reputation: 20748
You can use an HtmlXPathSelector
and an XPath expression selecting all descendant text nodes of the div
with ID "breadcrumbs", such as id("breadcrumbs")//text()
To illustrate that, I'll use the scrapy shell
command, which gives you an HtmlXPathSelector
instance, hxs
:
paul@wheezy:~$ scrapy shell http://www.jabong.com/z-collection-Olive-Mocassins-376735.html
...
2013-10-15 09:30:06+0200 [default] DEBUG: Crawled (200) <GET http://www.jabong.com/z-collection-Olive-Mocassins-376735.html> (referer: None)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html><head><meta http-equiv="Content-Ty'>
...
In [1]: hxs.select('id("breadcrumbs")//text()').extract()
Out[1]:
[u'\r\n ',
u'Home',
u'\r\n ',
u'>',
u'\r\n ',
u'Men',
u'\r\n ',
u'>',
u'\r\n ',
u'Shoes',
u'\r\n ',
u'>',
u'\r\n ',
u'Casual Shoes',
u'\r\n ',
u'>',
u'\r\n ',
u'Moccasins',
u'\r\n ',
u'>',
u'\r\n ',
u'Olive Mocassins',
u'\r\n \r\n',
u'\r\n ',
u'\r\n\r\n ']
If you need to strip those whitespace characters, you can use map()
with unicode.strip
In [2]: map(unicode.strip, hxs.select('id("breadcrumbs")//text()').extract())
Out[2]:
[u'',
u'Home',
u'',
u'>',
u'',
u'Men',
u'',
u'>',
u'',
u'Shoes',
u'',
u'>',
u'',
u'Casual Shoes',
u'',
u'>',
u'',
u'Moccasins',
u'',
u'>',
u'',
u'Olive Mocassins',
u'',
u'',
u'']
In [3]:
You can remove those empty lines using filter()
In [4]: filter(bool, map(unicode.strip, hxs.select('id("breadcrumbs")//text()').extract()))
Out[4]:
[u'Home',
u'>',
u'Men',
u'>',
u'Shoes',
u'>',
u'Casual Shoes',
u'>',
u'Moccasins',
u'>',
u'Olive Mocassins']
In [5]:
Here's a one-liner to get breadcrumbs as a single string, using str.join()
and map()
again:
In [9]: ' '.join(map(unicode.strip, hxs.select('id("breadcrumbs")//text()').extract())).strip()
Out[9]: u'Home > Men > Shoes > Casual Shoes > Moccasins > Olive Mocassins'
or even:
In [10]: ' '.join(filter(bool, map(unicode.strip, hxs.select('id("breadcrumbs")//text()').extract())))
Out[10]: u'Home > Men > Shoes > Casual Shoes > Moccasins > Olive Mocassins'
Upvotes: 4