Adrian
Adrian

Reputation: 198

Extract all elements from within p tag scrapy

I am using scrapy to scrap a website that has a similar structure to the following:

<table>
    <td>
        <p>Some text</p>
    </td>
    <td>
        <p>
            <strong>More Text</strong>
            <br />Another Text
        </p>
    </td>
    ...
</table>

I am able to scrap all the text inside the different

tags with something like this //p//text().extract() the problem is that this splits the elements inside the same tag in the result:

'text': ['Some text', 'More Text', 'Another Text']

And ideally I would need it like this:

'text': ['Some text', 'More Text Another Text']

Is it possible to get the result like that?

Upvotes: 1

Views: 2782

Answers (2)

gangabass
gangabass

Reputation: 10666

Another way is to use XPath string() (you may need to strip() it later):

text = response.xpath('string(//p)').extract()

Upvotes: 0

stasdeep
stasdeep

Reputation: 3146

In these cases I do the following trick:

text = [
    ' '.join(
        line.strip() 
        for line in p.xpath('.//text()').extract() 
        if line.strip()
    ) 
    for p in response.xpath('//p')
]

This will give you exactly what you want.

Upvotes: 3

Related Questions