Reputation: 13
I am trying to scrape the text of the reviews on Amazon using scrapy. The problem is that when a review consists of multiple enters, the text in a span element is separated by < br > tags. So, when I want to scrape the first review I use this line of code:
response.css('span.a-size-base.review-text::text').extract_first()
This does not give me all the text of the review, but only the text between the < span > element and the first < br > element.
I know that when I replace "extract_first()" by "extract()", I will get all the text. However, this also gives me the text of the other reviews.
So basically, the extract() method returns an array with the elements being separated by < br > tags. I need it to be separated by the < span > tags.
Is there a way to scrape all text between the open < span > element and the closing < /span > element?
example of HTML code:
< span data-hook="review-body" class="a-size-base review-text">
"I like this product, the reasons why are explained below"
< br >
< br >
"1. It looks nice"
< br >
"2. I love it"
< /span >
What it looks like on the site:
I like this product, the reasons why are explained below
Output I will get using extract_first():
"I like this product, the reasons why are explained below"
Output I will get using extract() (note that it consists of three elements):
"I like this product, the reasons why are explained below", "1. It looks nice", "2. I love it"
Output I want to get (only one element, the review itself):
"I like this product, the reasons why are explained below 1. It looks nice 2. I love it"
Upvotes: 0
Views: 673
Reputation: 55
Use extract() and join the list.
>>> text=["I like this product, the reasons why are explained below", "1. It looks nice", "2. I love it"]
>>> " ".join(text)
'I like this product, the reasons why are explained below 1. It looks nice 2. I love it'
Upvotes: 1