Raouf Yahiaoui
Raouf Yahiaoui

Reputation: 71

Extract text from a site using Scrapy spider

I am trying to extract the description of a book from amazon site. Note: I am using Scrapy spider: this is the link of the amazon book: https://www.amazon.com/Local-Woman-Missing-Mary-Kubica/dp/1665068671

this the div that contains the text of the decription inside:

<div aria-expanded="true" class="a-expander-content a-expander-partial-collapse-content 
a-expander-content-expanded" style="padding-bottom: 20px;"> <p><span class="a-text- 
bold">MP3 CD Format</span></p><p><span class="a-text-bold">People don’t just disappear 
without a trace…</span></p><p class="a-text-bold"><span class="a-text-bold">Shelby Tebow 
is the first to go missing. Not long after, Meredith Dickey and her six-year-old 
daughter, Delilah, vanish just blocks away from where Shelby was last seen, striking 
fear into their once-peaceful community. Are these incidents connected? After an elusive 
search that yields more questions than answers, the case eventually goes cold.</span> 
</p><p class="a-text-bold"><span class="a-text-bold">Now, eleven years later, Delilah 
shockingly returns. Everyone wants to know what happened to her, but no one is prepared 
for what they’ll find…</span></p><p class="a-text-bold"><span class="a-text-bold">In 
this smart and chilling thriller, master of suspense and New York Times bestselling 
author Mary Kubica takes domestic secrets to a whole new level, showing that some people 
will stop at nothing to keep the truth buried.</span></p><p></p>  </div>

actually I tried this line:

div = response.css(".a-expander-content.a-expander-partial-collapse-content.a-expander-content-expanded")
description = " ".join([re.sub('<.*?>', '', span) for span in response.css('.a-expander-content span').extract()])

it's not working as expected. Please if you have any idea share it here. Thanks in advance

Upvotes: 0

Views: 66

Answers (1)

Ikram Khan Niazi
Ikram Khan Niazi

Reputation: 805

Here is the scrapy code:

import scrapy
from scrapy.spiders import Request

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    start_urls = ['https://www.amazon.com/dp/1665068671']

    def start_requests(self):
        yield Request(self.start_urls[0], callback=self.parse_book)

    def parse_book(self, response):
        description = "".join(response.css('[data-a-expander-name="book_description_expander"] .a-expander-content ::text').getall())
        yield {"description": description}

Output:

{'description': ' MP3 CD FormatPeople don’t just disappear without a trace…Shelby Tebow is the first to go missing. Not long after, Meredith Dickey and her six-year-old daughter, Delilah, vanish just blocks away from where Shelby was last seen, striking fear into their once-peaceful community. Are these incidents connected? After an elusive search that yields more questions than answers, the case eventually goes cold.Now, eleven years later, Delilah shockingly returns. Everyone wants to know what happened to her, but no one is prepared for what they’ll find…In this smart and chilling thriller, master of suspense and New York Times bestselling author Mary Kubica takes domestic secrets to a whole new level, showing that some people will stop at nothing to keep the truth buried.  '}

Upvotes: 2

Related Questions