How can I extract text in a div and the hyperlink in it using scrapy?

Question

I am using scrapy to scrape the text from a website. I am a beginner on scrapy and xpath both. Some of the 'div' tags contain some text followed by a link and then some text again. I would like to extract both the text and the link in the same order.

For eg.


    
       
          Anonymous wrote:Daycares have been open for at least six months. Some never closed during the pandemic. They seem to be doing fine. 
          
 
          
 Sorry, teachers. Vacation has to end sometime. 
        
    
  
 
  
 Nah, they're not doing fine.
  
 https://coronavirus.dc.gov/page/outbreak-data
  
 
  
 /not a teacher

I only want the following

Nah, they're not doing fine.
https://coronavirus.dc.gov/page/outbreak-data

/not a teacher

This is what I have done so far. But it gets the link from undesirable sections too(like 'blockquotes' tag in above example) and attaches it to the end.

post_text_response = post.xpath('.//div[@class="postbody"]/text()').getall()
post_link_attached = post.xpath('.//div[@class="postbody"]//a/@href')
post_text = re.sub(r'\s+', " ", "".join(post_text_response)).strip()             
if(len(post_link_attached)>0):
    post_text += " " + post_link_attached.extract_first()

How can I achieve this in scrapy using XPath?
Although my approach seems to work, is there a better way to ignore 'blockquote' tags while using xpath?

zx485 · Accepted Answer

You can use the following XPath-1.0 expression:

//div[@class='postbody']/br/following-sibling::text()[1] | //div[@class='postbody']/br/following-sibling::a[1]/text()

This expression outputs all text() nodes or a/text() nodes that follow any element which is a child of

.

Its output is

Nah, they're not doing fine.
https://coronavirus.dc.gov/page/outbreak-data 
/not a teacher

How can I extract text in a div and the hyperlink in it using scrapy?

Answers (1)

Related Questions