Reputation: 484
So I have this html:
<html>
<p>
This is my first sentence
<br>
This sentance should be considered as part of the first one.
<br>
And this also
</p>
<p>
This is the second sentence
</p>
</html>
I want to extract the text from the p nodes, all the text in one node should be returned as one element, I am using scrapy shell like this:
scrapy shell path/to/file.html
response.xpath('//p/text()').extract()
the output I get is:
[
'This is my first sentence',
'This sentance should be considered as part of the first one.'
'And this also'
'This is the second sentence'
]
the output I want:
[
'This is my first sentence This sentance should be considered as part of the first one And this also'
'This is the second sentence'
]
Any help about how to solve this using xpath expression
Thank you very much :))))
Upvotes: 0
Views: 606
Reputation: 3847
Alternatively, you could have avoided w3lib
using ' '.join()
as suggested in the comments:
paragraphs = response.css('p')
paragraphs = [' '.join(p.xpath('./text()').getall()) for p in paragraphs]
Upvotes: 1
Reputation: 484
This solved the issue...
from w3lib.html import remove_tags
two_texts = response.xpath('//p').extract()
two_texts = [remove_tags(text) for text in two_texts]
Upvotes: 1