Rodwan Bakkar
Rodwan Bakkar

Reputation: 484

xpath to extract all the text in a specific node and return it as one element using scrapy

So I have this html:

<html>
<p>
   This is my first sentence
   <br>
   This sentance should be considered as part of the first one.
   <br>
   And this also
</p>
<p>
   This is the second sentence
</p>
</html>

I want to extract the text from the p nodes, all the text in one node should be returned as one element, I am using scrapy shell like this:

scrapy shell path/to/file.html
response.xpath('//p/text()').extract()

the output I get is:

[
'This is my first sentence',
'This sentance should be considered as part of the first one.'
'And this also'
'This is the second sentence'
]

the output I want:

[
 'This is my first sentence This sentance should be considered as part of the first one And this also'
 'This is the second sentence'
]

Any help about how to solve this using xpath expression

Thank you very much :))))

Upvotes: 0

Views: 606

Answers (2)

Gallaecio
Gallaecio

Reputation: 3847

Alternatively, you could have avoided w3lib using ' '.join() as suggested in the comments:

paragraphs = response.css('p')
paragraphs = [' '.join(p.xpath('./text()').getall()) for p in paragraphs]

Upvotes: 1

Rodwan Bakkar
Rodwan Bakkar

Reputation: 484

This solved the issue...

from w3lib.html import remove_tags
two_texts = response.xpath('//p').extract()
two_texts = [remove_tags(text) for text in two_texts]

Upvotes: 1

Related Questions