silentcobra
silentcobra

Reputation: 1

How can I extract text in a div and the hyperlink in it using scrapy?

I am using scrapy to scrape the text from a website. I am a beginner on scrapy and xpath both. Some of the 'div' tags contain some text followed by a link and then some text again. I would like to extract both the text and the link in the same order.

For eg.

<div class="postbody">
    <blockquote>
       <div>
          <cite>Anonymous wrote:</cite>Daycares have been open for at least six months. Some never closed during the pandemic. They seem to be doing fine. 
          <br /> 
          <br /> Sorry, teachers. Vacation has to end sometime. 
        </div>
    </blockquote>
  <br /> 
  <br /> Nah, they're not doing fine.
  <br /> <a class="snap_shots" href="https://coronavirus.dc.gov/page/outbreak-data" target="_blank" rel="nofollow">https://coronavirus.dc.gov/page/outbreak-data</a>
  <br /> 
  <br /> /not a teacher
</div>

I only want the following

Nah, they're not doing fine.
https://coronavirus.dc.gov/page/outbreak-data

/not a teacher

This is what I have done so far. But it gets the link from undesirable sections too(like 'blockquotes' tag in above example) and attaches it to the end.

post_text_response = post.xpath('.//div[@class="postbody"]/text()').getall()
post_link_attached = post.xpath('.//div[@class="postbody"]//a/@href')
post_text = re.sub(r'\s+', " ", "".join(post_text_response)).strip()             
if(len(post_link_attached)>0):
    post_text += " " + post_link_attached.extract_first()
  1. How can I achieve this in scrapy using XPath?
  2. Although my approach seems to work, is there a better way to ignore 'blockquote' tags while using xpath?

Upvotes: 0

Views: 889

Answers (1)

zx485
zx485

Reputation: 29022

You can use the following XPath-1.0 expression:

//div[@class='postbody']/br/following-sibling::text()[1] | //div[@class='postbody']/br/following-sibling::a[1]/text()

This expression outputs all text() nodes or a/text() nodes that follow any <br> element which is a child of <div class="postbody">.


Its output is

Nah, they're not doing fine.
https://coronavirus.dc.gov/page/outbreak-data 
/not a teacher

Upvotes: 1

Related Questions