Wiggy A.
Wiggy A.

Reputation: 496

Xpath, selecting text from B inside div while also taking normal text

Basically I have html similar to this:

<div>
    <p>
        <b>1</b> Communication
    </p>
    <p>
        <b>2</b> Errors
    </p>
    ...
</div>

What I'm trying (with Scrapy) is something like:

response.xpath("//div//p//text()")

However this returns a list such as

[
    "1",
    "Communication",
    "2",
    "Errors"    
]

I want to have something like:

[
    "1 Communication",
    "2 Errors"
]

Any help here would be greatly appreciated. I was trying to figure out a way to ignore the b tags, but I couldn't find anything that actually worked. The reason I can't just join the list indexes by two's is because not every html I need to parse works like this. I want to use something that could ignore the b tags if they exist, while just getting the text in p in any case. Thanks!

Upvotes: 1

Views: 116

Answers (1)

Tom&#225;š Linhart
Tom&#225;š Linhart

Reputation: 10210

If your general pattern is to ignore <b> tags, you could use w3lib to remove those tags and construct new response from the result. Something like:

import w3lib
import scrapy

new_body = w3lib.html.remove_tags(response.body, which_ones=('b'))
new_response = scrapy.http.HtmlResponse(url=response.url, body=new_body)

new_response now contains the original response but with <b> tags removed. You can then use extraction logic without the need to consider them.

Upvotes: 1

Related Questions