Reputation: 71
I am trying to extract the description of a book from amazon site. Note: I am using Scrapy spider: this is the link of the amazon book: https://www.amazon.com/Local-Woman-Missing-Mary-Kubica/dp/1665068671
this the div that contains the text of the decription inside:
<div aria-expanded="true" class="a-expander-content a-expander-partial-collapse-content
a-expander-content-expanded" style="padding-bottom: 20px;"> <p><span class="a-text-
bold">MP3 CD Format</span></p><p><span class="a-text-bold">People don’t just disappear
without a trace…</span></p><p class="a-text-bold"><span class="a-text-bold">Shelby Tebow
is the first to go missing. Not long after, Meredith Dickey and her six-year-old
daughter, Delilah, vanish just blocks away from where Shelby was last seen, striking
fear into their once-peaceful community. Are these incidents connected? After an elusive
search that yields more questions than answers, the case eventually goes cold.</span>
</p><p class="a-text-bold"><span class="a-text-bold">Now, eleven years later, Delilah
shockingly returns. Everyone wants to know what happened to her, but no one is prepared
for what they’ll find…</span></p><p class="a-text-bold"><span class="a-text-bold">In
this smart and chilling thriller, master of suspense and New York Times bestselling
author Mary Kubica takes domestic secrets to a whole new level, showing that some people
will stop at nothing to keep the truth buried.</span></p><p></p> </div>
actually I tried this line:
div = response.css(".a-expander-content.a-expander-partial-collapse-content.a-expander-content-expanded")
description = " ".join([re.sub('<.*?>', '', span) for span in response.css('.a-expander-content span').extract()])
it's not working as expected. Please if you have any idea share it here. Thanks in advance
Upvotes: 0
Views: 66
Reputation: 805
Here is the scrapy code:
import scrapy
from scrapy.spiders import Request
class AmazonSpider(scrapy.Spider):
name = 'amazon'
start_urls = ['https://www.amazon.com/dp/1665068671']
def start_requests(self):
yield Request(self.start_urls[0], callback=self.parse_book)
def parse_book(self, response):
description = "".join(response.css('[data-a-expander-name="book_description_expander"] .a-expander-content ::text').getall())
yield {"description": description}
Output:
{'description': ' MP3 CD FormatPeople don’t just disappear without a trace…Shelby Tebow is the first to go missing. Not long after, Meredith Dickey and her six-year-old daughter, Delilah, vanish just blocks away from where Shelby was last seen, striking fear into their once-peaceful community. Are these incidents connected? After an elusive search that yields more questions than answers, the case eventually goes cold.Now, eleven years later, Delilah shockingly returns. Everyone wants to know what happened to her, but no one is prepared for what they’ll find…In this smart and chilling thriller, master of suspense and New York Times bestselling author Mary Kubica takes domestic secrets to a whole new level, showing that some people will stop at nothing to keep the truth buried. '}
Upvotes: 2