Reputation: 4975
With scrapy I'm trying to get the items of an UL list. Only the text, not the HTML code. But I can't get it quite done. I just want the complete text between each li tag as ONE string (including the text between tags and such within a li tag). This is an example of the HTML code:
<ul>
<li>Stoere HUMMER of Cadillic opbergtas (rood)</li>
<li>EHBO First Aid Rapid Response kit</li>
<li>LifeHammer met houder</li>
<li>Aluminium <b>Midi</b> Zaklamp<br/> met alarm inclusief 3x AAA batterij</li>
<li>Werkhandschoenen</li>
<li>IJskrabber</li>
<li>Afbreekmes</li>
<li>2x veiligheidshesje</li>
<li>Verbandschaar</li>
<li>Reddingsdeken</li>
<li>Verband + pleister <span>9 x rol verband</span> diverse afmetingen Pleisters</li>
<li>Handschoenen</li>
<li>3 x steriele gaasjes</li>
</ul>
As you can see, it is possible for a list item to contain <span>
, <b>
or other tags. With the xpath below it's possible for me to list all items in a Python list item:
sel.xpath('//*[@id="tab_description"]/ul/li[descendant-or-self::text()]').extract()
Result:
['<li>Stoere HUMMER of Cadillic opbergtas (rood)</li>',
'<li>EHBO First Aid Rapid Response kit</li>',
'<li>LifeHammer met houder</li>',
'<li>Aluminium <b>Midi</b> Zaklamp<br/> met alarm inclusief 3x AAA batterij</li>',
'<li>Werkhandschoenen</li>',
'<li>IJskrabber</li>',
'<li>Afbreekmes</li>',
'<li>2x veiligheidshesje</li>',
'<li>Verbandschaar</li>',
'<li>Reddingsdeken</li>',
'<li>Verband + pleister <span>9 x rol verband</span> diverse afmetingen Pleisters</li>',
'<li>Handschoenen</li>',
'<li>3 x steriele gaasjes</li>',]
But as you can see it contains all html code as well. I just want to receive the text. If I try this:
sel.xpath('//*[@id="tab_description"]/ul/li/descendant-or-self::text()').extract()
The result would be this:
['Stoere HUMMER of Cadillic opbergtas (rood)',
'EHBO First Aid Rapid Response kit',
'LifeHammer met houder</li>',
'Aluminium ',
'Midi',
'',
' Zaklamp met alarm inclusief 3x AAA batterij',
'Werkhandschoenen',
'IJskrabber',
'Afbreekmes',
'2x veiligheidshesje',
'Verbandschaar',
'Reddingsdeken',
'Verband + pleister ',
'9 x rol verband',
' diverse afmetingen Pleisters',
'Handschoenen',
'3 x steriele gaasjes',]
As you can see, the results between the <span>
, <b>
tags and such (within a li tag) will be saved as a seperate list item, which is not correct either.
I just want the complete text between each li tag as ONE string (including the text between <b>
tags and such within a li tag).
This doesn't work either, cause the xpath below skips the text between html code.
sel.xpath('//*[@id="tab_description"]/ul/li/text()').extract()
Can someone help me out?
Upvotes: 0
Views: 1954
Reputation: 20748
You have at least 2 options.
.//text()
to get text inside tags that are in li
elements and join individual stringsstring()
function (or normalize-space()
) on each li
So you can do
[u"".join(li.xpath('.//text()').extract())
for li in sel.xpath('//*[@id="tab_description"]/ul/li')]
or
[li.xpath('string(.)').extract()[0]
for li in sel.xpath('//*[@id="tab_description"]/ul/li')]
Both would give you
[u'Stoere HUMMER of Cadillic opbergtas (rood)',
u'EHBO First Aid Rapid Response kit',
u'LifeHammer met houder',
u'Aluminium Midi Zaklamp met alarm inclusief 3x AAA batterij',
u'Werkhandschoenen',
u'IJskrabber',
u'Afbreekmes',
u'2x veiligheidshesje',
u'Verbandschaar',
u'Reddingsdeken',
u'Verband + pleister 9 x rol verband diverse afmetingen Pleisters',
u'Handschoenen',
u'3 x steriele gaasjes']
Upvotes: 2