Dervin Thunk
Dervin Thunk

Reputation: 20140

Matching multiple <p> tags in scrapy

I have something like the following html:

<div class="articleBody">
  <p>
    <strong>Text</strong> lorem ipsum... 
    <strong>lorem ipsum...</strong>
  </p>
  <p>lorem ipsum 
    <strong> lorem ipsum lorem ipsum</strong>
    lorem ipsum...lorem ipsum...lorem ipsum...lorem ipsum...
  </p>
</div>

In a more general way, I have a list of <p> tags with a few <strong> tags inside.

I would like to get the text of all the <p> tags, minus the <strong> tags... and by that, I mean just the text in the "articleBody" div class.

What I have is

response.xpath('string(//div[@class="articleBody"]//p)'.extract()

but that only returns the first <p>.

Any help would be appreciated.

Upvotes: 1

Views: 1582

Answers (1)

KorreyD
KorreyD

Reputation: 1294

give this a shot:

for node in response.xpath('//div[@class="articleBody"]//p'):
        print node.xpath('string()').extract()

...then you can concatenate your strings or add them to a list or whatever instead of just printing them like I did.

there is also the string-join() function for xpath 2.0 but it looks like scrapy supports xpath 1.0.

more info about string-join and such here: http://www.w3.org/TR/xpath-functions/#func-string-join

Upvotes: 4

Related Questions