Reputation: 198
I am using scrapy to scrap a website that has a similar structure to the following:
<table>
<td>
<p>Some text</p>
</td>
<td>
<p>
<strong>More Text</strong>
<br />Another Text
</p>
</td>
...
</table>
I am able to scrap all the text inside the different
tags with something like this //p//text().extract()
the problem is that this splits the elements inside the same tag in the result:
'text': ['Some text', 'More Text', 'Another Text']
And ideally I would need it like this:
'text': ['Some text', 'More Text Another Text']
Is it possible to get the result like that?
Upvotes: 1
Views: 2782
Reputation: 10666
Another way is to use XPath string()
(you may need to strip()
it later):
text = response.xpath('string(//p)').extract()
Upvotes: 0
Reputation: 3146
In these cases I do the following trick:
text = [
' '.join(
line.strip()
for line in p.xpath('.//text()').extract()
if line.strip()
)
for p in response.xpath('//p')
]
This will give you exactly what you want.
Upvotes: 3