Erik van de Ven
Erik van de Ven

Reputation: 4975

Scrapy, receive just the text from an unordered list, including text between other html tags

With scrapy I'm trying to get the items of an UL list. Only the text, not the HTML code. But I can't get it quite done. I just want the complete text between each li tag as ONE string (including the text between tags and such within a li tag). This is an example of the HTML code:

<ul>
  <li>Stoere HUMMER of Cadillic opbergtas (rood)</li>
  <li>EHBO First Aid Rapid Response kit</li>
  <li>LifeHammer met houder</li>
  <li>Aluminium <b>Midi</b> Zaklamp<br/> met alarm inclusief 3x AAA batterij</li>
  <li>Werkhandschoenen</li>
  <li>IJskrabber</li>
  <li>Afbreekmes</li>
  <li>2x veiligheidshesje</li>
  <li>Verbandschaar</li>
  <li>Reddingsdeken</li>
  <li>Verband + pleister <span>9 x rol verband</span> diverse afmetingen Pleisters</li>
  <li>Handschoenen</li>
  <li>3 x steriele gaasjes</li>
</ul>

As you can see, it is possible for a list item to contain <span>, <b> or other tags. With the xpath below it's possible for me to list all items in a Python list item:

sel.xpath('//*[@id="tab_description"]/ul/li[descendant-or-self::text()]').extract()

Result:

['<li>Stoere HUMMER of Cadillic opbergtas (rood)</li>',
 '<li>EHBO First Aid Rapid Response kit</li>',
 '<li>LifeHammer met houder</li>',
 '<li>Aluminium <b>Midi</b> Zaklamp<br/> met alarm inclusief 3x AAA batterij</li>',
 '<li>Werkhandschoenen</li>',
 '<li>IJskrabber</li>',
 '<li>Afbreekmes</li>',
 '<li>2x veiligheidshesje</li>',
 '<li>Verbandschaar</li>',
 '<li>Reddingsdeken</li>',
 '<li>Verband + pleister <span>9 x rol verband</span> diverse afmetingen Pleisters</li>',
 '<li>Handschoenen</li>',
 '<li>3 x steriele gaasjes</li>',]

But as you can see it contains all html code as well. I just want to receive the text. If I try this:

sel.xpath('//*[@id="tab_description"]/ul/li/descendant-or-self::text()').extract()

The result would be this:

['Stoere HUMMER of Cadillic opbergtas (rood)',
 'EHBO First Aid Rapid Response kit',
 'LifeHammer met houder</li>',
 'Aluminium ',
 'Midi',
 '',
 ' Zaklamp met alarm inclusief 3x AAA batterij',
 'Werkhandschoenen',
 'IJskrabber',
 'Afbreekmes',
 '2x veiligheidshesje',
 'Verbandschaar',
 'Reddingsdeken',
 'Verband + pleister ',
 '9 x rol verband',
 ' diverse afmetingen Pleisters',
 'Handschoenen',
 '3 x steriele gaasjes',]

As you can see, the results between the <span>, <b> tags and such (within a li tag) will be saved as a seperate list item, which is not correct either.

I just want the complete text between each li tag as ONE string (including the text between <b> tags and such within a li tag).

This doesn't work either, cause the xpath below skips the text between html code. sel.xpath('//*[@id="tab_description"]/ul/li/text()').extract()

Can someone help me out?

Upvotes: 0

Views: 1954

Answers (1)

paul trmbrth
paul trmbrth

Reputation: 20748

You have at least 2 options.

  1. use .//text() to get text inside tags that are in li elements and join individual strings
  2. use the string() function (or normalize-space()) on each li

So you can do

[u"".join(li.xpath('.//text()').extract())
 for li in sel.xpath('//*[@id="tab_description"]/ul/li')]

or

[li.xpath('string(.)').extract()[0]
 for li in sel.xpath('//*[@id="tab_description"]/ul/li')]

Both would give you

[u'Stoere HUMMER of Cadillic opbergtas (rood)',
 u'EHBO First Aid Rapid Response kit',
 u'LifeHammer met houder',
 u'Aluminium Midi Zaklamp met alarm inclusief 3x AAA batterij',
 u'Werkhandschoenen',
 u'IJskrabber',
 u'Afbreekmes',
 u'2x veiligheidshesje',
 u'Verbandschaar',
 u'Reddingsdeken',
 u'Verband + pleister 9 x rol verband diverse afmetingen Pleisters',
 u'Handschoenen',
 u'3 x steriele gaasjes']

Upvotes: 2

Related Questions