6ones
6ones

Reputation: 62

extract_first() and extract() methods on Scrapy selectors are not returning the same value

I am using Scrapy to collect data from a cinema webpage.

Working with the XPath selectors, if I use the selectors with the extract() method, as such:

def parse_with_extract(self, response):
    div = response.xpath("//div[@class='col-sm-7 col-md-9']/p[@class='movie__option']")
    data = i.xpath("text()").extract()
    return data

It returns:

Result of extract() on selectors

If I use the selector with the extract_first() method as such:

def parse_with_extract_first(self, response):
    div = response.xpath("//div[@class='col-sm-7 col-md-9']/p[@class='movie__option']")
    storage = []
    for i in div:
        data = i.xpath("text()").extract_first()
        storage.append(data)
    return storage

It returns:

Result of extract_first() on selectors

Why is the extract() method returning all characters, including the "\xa0", and the extract_first() method returning an empty string instead?

Upvotes: 0

Views: 4481

Answers (2)

stasdeep
stasdeep

Reputation: 3146

If you look closer at the response, you'll see that @class=movie__option element looks like this:

'<p class="movie__option" style="color: #000;">\n                                    <strong>Thursday 3rd of May 2018:</strong>\n                                    11:20am\xa0 \xa0  \n                                </p>'

If you extract text() of this element you basically get two strings: one which is before strong tag and one which is after (text() takes only first-level text):

['\n                                    ',
 '\n                                    11:20am\xa0 \xa0  \n                                ']

What extract_first does is just taking the first of these two strings:

'\n                                    '

Upvotes: 4

Jawad Mehmood
Jawad Mehmood

Reputation: 111

Well, as of your output, which looks something like the following:

['\n                                    ',
 '\n                                    11:20am\xa0 \xa0  \n                                ']

contains two strings.

My suggestions for everyone who is getting the same data in return like new lines and empty space, use Python's built-in method strip(). This method applies to a string. So, you can apply this method by following:

data = response.xpath("//path/to/your/data").get().strip()

This will make your output looks something like this:

'11:20am'

Also, have a look at what's the difference between extract() and extract_first().

  1. extract()
    

This method returns the list. This is an old method in Scrapy. The method used nowadays, instead of extract(), is getall(). It's the same as extract().

extract() -- updated to --> getall()

Now let’s take a look at extract_first() method

  1. extract_first()
    

This method returns the str instead of the list. This is also an old method in Scrapy. The method used nowadays, instead of extract_first(), is get().

extract_first() -- updated to --> get()

Upvotes: 2

Related Questions