oeb
oeb

Reputation: 189

Extract content between <\p> using xpath for webscraping

I am trying to extract jokes from a website and I need to get the jokes one by one:

div class="oneliner" 
     itemscope="" 
     itemtype="http://schema.org/Article">

            <p>My girl always tells me "Life is about the little things", but I  just hate when she talks about her Ex.</p>

What I came up with so far using xpath is

.xpath('//div[@class="oneliner"]')

With this I am able to extract the single items, but now I want to loop over all occurences and extract the text (everything between \p ). For this I tried

for joke in jokes:

     item['joke'] = joke.xpath('//p/text()').extract()

But this gives me all jokes from that page at once instead of going through one by one. Could anyone help me with this?

Upvotes: 0

Views: 125

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21446

You can simply iterate through joke nodes and yield an item with every iteration:

def parse(self, response):
    jokes = response.xpath('//div[@class="oneliner"]')
    for joke in jokes:
        item = dict()
        item['joke'] = joke.xpath('.//p/text()').extract()
        yield item

Upvotes: 2

Related Questions