Reputation: 62
I am using Scrapy to collect data from a cinema webpage.
Working with the XPath selectors, if I use the selectors with the extract() method, as such:
def parse_with_extract(self, response):
div = response.xpath("//div[@class='col-sm-7 col-md-9']/p[@class='movie__option']")
data = i.xpath("text()").extract()
return data
It returns:
If I use the selector with the extract_first() method as such:
def parse_with_extract_first(self, response):
div = response.xpath("//div[@class='col-sm-7 col-md-9']/p[@class='movie__option']")
storage = []
for i in div:
data = i.xpath("text()").extract_first()
storage.append(data)
return storage
It returns:
Why is the extract() method returning all characters, including the "\xa0", and the extract_first() method returning an empty string instead?
Upvotes: 0
Views: 4481
Reputation: 3146
If you look closer at the response, you'll see that @class=movie__option
element looks like this:
'<p class="movie__option" style="color: #000;">\n <strong>Thursday 3rd of May 2018:</strong>\n 11:20am\xa0 \xa0 \n </p>'
If you extract text()
of this element you basically get two strings: one which is before strong
tag and one which is after (text()
takes only first-level text):
['\n ',
'\n 11:20am\xa0 \xa0 \n ']
What extract_first
does is just taking the first of these two strings:
'\n '
Upvotes: 4
Reputation: 111
Well, as of your output, which looks something like the following:
['\n ',
'\n 11:20am\xa0 \xa0 \n ']
contains two strings.
My suggestions for everyone who is getting the same data in return like new lines and empty space, use Python's built-in method strip(). This method applies to a string. So, you can apply this method by following:
data = response.xpath("//path/to/your/data").get().strip()
This will make your output looks something like this:
'11:20am'
Also, have a look at what's the difference between extract() and extract_first().
extract()
This method returns the list. This is an old method in Scrapy. The method used nowadays, instead of extract(), is getall(). It's the same as extract().
extract() -- updated to --> getall()
Now let’s take a look at extract_first() method
extract_first()
This method returns the str instead of the list. This is also an old method in Scrapy. The method used nowadays, instead of extract_first(), is get().
extract_first() -- updated to --> get()
Upvotes: 2