scrapitnow
scrapitnow

Reputation: 13

<br> tags screws up my data using scrapy and python

I am trying to scrape the text of the reviews on Amazon using scrapy. The problem is that when a review consists of multiple enters, the text in a span element is separated by < br > tags. So, when I want to scrape the first review I use this line of code:

response.css('span.a-size-base.review-text::text').extract_first()

This does not give me all the text of the review, but only the text between the < span > element and the first < br > element.

I know that when I replace "extract_first()" by "extract()", I will get all the text. However, this also gives me the text of the other reviews.

So basically, the extract() method returns an array with the elements being separated by < br > tags. I need it to be separated by the < span > tags.

Is there a way to scrape all text between the open < span > element and the closing < /span > element?

example of HTML code:

< span data-hook="review-body" class="a-size-base review-text">
    "I like this product, the reasons why are explained below"
    < br >
    < br >
    "1. It looks nice" 
    < br >
    "2. I love it"
< /span >

What it looks like on the site:

I like this product, the reasons why are explained below

  1. It looks nice
  2. I love it

Output I will get using extract_first():

"I like this product, the reasons why are explained below"

Output I will get using extract() (note that it consists of three elements):

"I like this product, the reasons why are explained below", "1. It looks nice", "2. I love it"

Output I want to get (only one element, the review itself):

"I like this product, the reasons why are explained below 1. It looks nice 2. I love it"

Upvotes: 0

Views: 673

Answers (1)

Pandian Muninathan
Pandian Muninathan

Reputation: 55

Use extract() and join the list.

>>> text=["I like this product, the reasons why are explained below", "1. It looks nice", "2. I love it"]
>>> " ".join(text)
'I like this product, the reasons why are explained below 1. It looks nice 2. I love it'

Upvotes: 1

Related Questions