extract_first() and extract() methods on Scrapy selectors are not returning the same value

Question

I am using Scrapy to collect data from a cinema webpage.

Working with the XPath selectors, if I use the selectors with the extract() method, as such:

def parse_with_extract(self, response):
    div = response.xpath("//div[@class='col-sm-7 col-md-9']/p[@class='movie__option']")
    data = i.xpath("text()").extract()
    return data

It returns:

If I use the selector with the extract_first() method as such:

def parse_with_extract_first(self, response):
    div = response.xpath("//div[@class='col-sm-7 col-md-9']/p[@class='movie__option']")
    storage = []
    for i in div:
        data = i.xpath("text()").extract_first()
        storage.append(data)
    return storage

It returns:

Why is the extract() method returning all characters, including the "\xa0", and the extract_first() method returning an empty string instead?

stasdeep · Accepted Answer

If you look closer at the response, you'll see that @class=movie__option element looks like this:

'
                                    Thursday 3rd of May 2018:
                                    11:20am\xa0 \xa0  
                                '

If you extract text() of this element you basically get two strings: one which is before strong tag and one which is after (text() takes only first-level text):

['
                                    ',
 '
                                    11:20am\xa0 \xa0  
                                ']

What extract_first does is just taking the first of these two strings:

'
                                    '

extract_first() and extract() methods on Scrapy selectors are not returning the same value

Answers (2)

Related Questions