How to extract only text from the div containing more divs using scrapy

Question

I have a div element which contains more child elements. I want to crape only the text from all the child elements of that div. Is there any inbuilt funcion or scrapy property for that.

example : I need to scrape the breadcrumb from http://www.jabong.com/z-collection-Olive-Mocassins-376735.html

div id to scrape content from : breadcrumbs Desired Output: Home > Men > Shoes > Casual Shoes > Moccasins > Olive Mocassins

paul trmbrth · Accepted Answer

You can use an HtmlXPathSelector and an XPath expression selecting all descendant text nodes of the div with ID "breadcrumbs", such as id("breadcrumbs")//text()

To illustrate that, I'll use the scrapy shell command, which gives you an HtmlXPathSelector instance, hxs:

paul@wheezy:~$ scrapy shell http://www.jabong.com/z-collection-Olive-Mocassins-376735.html
...
2013-10-15 09:30:06+0200 [default] DEBUG: Crawled (200)  (referer: None)
[s] Available Scrapy objects:
[s]   hxs        ',
 u'
                                                                                                ',
 u'Men',
 u'
                                                ',
 u'>',
 u'
                                                                                                ',
 u'Shoes',
 u'
                                                ',
 u'>',
 u'
                                                                                                ',
 u'Casual Shoes',
 u'
                                                ',
 u'>',
 u'
                                                                                                ',
 u'Moccasins',
 u'
                                                ',
 u'>',
 u'
                                                                                                ',
 u'Olive Mocassins',
 u'
                                                         
',
 u'
        ',
 u'

        ']

If you need to strip those whitespace characters, you can use map() with unicode.strip

In [2]: map(unicode.strip, hxs.select('id("breadcrumbs")//text()').extract())
Out[2]: 
[u'',
 u'Home',
 u'',
 u'>',
 u'',
 u'Men',
 u'',
 u'>',
 u'',
 u'Shoes',
 u'',
 u'>',
 u'',
 u'Casual Shoes',
 u'',
 u'>',
 u'',
 u'Moccasins',
 u'',
 u'>',
 u'',
 u'Olive Mocassins',
 u'',
 u'',
 u'']

In [3]:

You can remove those empty lines using filter()

In [4]: filter(bool, map(unicode.strip, hxs.select('id("breadcrumbs")//text()').extract()))
Out[4]: 
[u'Home',
 u'>',
 u'Men',
 u'>',
 u'Shoes',
 u'>',
 u'Casual Shoes',
 u'>',
 u'Moccasins',
 u'>',
 u'Olive Mocassins']

In [5]:

Here's a one-liner to get breadcrumbs as a single string, using str.join() and map() again:

In [9]: ' '.join(map(unicode.strip, hxs.select('id("breadcrumbs")//text()').extract())).strip()
Out[9]: u'Home  >  Men  >  Shoes  >  Casual Shoes  >  Moccasins  >  Olive Mocassins'

or even:

In [10]: ' '.join(filter(bool, map(unicode.strip, hxs.select('id("breadcrumbs")//text()').extract())))
Out[10]: u'Home > Men > Shoes > Casual Shoes > Moccasins > Olive Mocassins'

How to extract only text from the div containing more divs using scrapy

Answers (1)

Related Questions