Reputation: 1
I am using scrapy to scrape the text from a website. I am a beginner on scrapy and xpath both. Some of the 'div' tags contain some text followed by a link and then some text again. I would like to extract both the text and the link in the same order.
For eg.
<div class="postbody">
<blockquote>
<div>
<cite>Anonymous wrote:</cite>Daycares have been open for at least six months. Some never closed during the pandemic. They seem to be doing fine.
<br />
<br /> Sorry, teachers. Vacation has to end sometime.
</div>
</blockquote>
<br />
<br /> Nah, they're not doing fine.
<br /> <a class="snap_shots" href="https://coronavirus.dc.gov/page/outbreak-data" target="_blank" rel="nofollow">https://coronavirus.dc.gov/page/outbreak-data</a>
<br />
<br /> /not a teacher
</div>
I only want the following
Nah, they're not doing fine.
https://coronavirus.dc.gov/page/outbreak-data
/not a teacher
This is what I have done so far. But it gets the link from undesirable sections too(like 'blockquotes' tag in above example) and attaches it to the end.
post_text_response = post.xpath('.//div[@class="postbody"]/text()').getall()
post_link_attached = post.xpath('.//div[@class="postbody"]//a/@href')
post_text = re.sub(r'\s+', " ", "".join(post_text_response)).strip()
if(len(post_link_attached)>0):
post_text += " " + post_link_attached.extract_first()
Upvotes: 0
Views: 889
Reputation: 29022
You can use the following XPath-1.0 expression:
//div[@class='postbody']/br/following-sibling::text()[1] | //div[@class='postbody']/br/following-sibling::a[1]/text()
This expression outputs all text()
nodes or a/text()
nodes that follow any <br>
element which is a child of <div class="postbody">
.
Its output is
Nah, they're not doing fine.
https://coronavirus.dc.gov/page/outbreak-data
/not a teacher
Upvotes: 1