Reputation: 3815
I am scraping an XML document like this:
>>> response.xpath("//ul[@class='meta-info d-flex flex-wrap align-items-center list-unstyled justify-content-around']/li[position()=2]/text()").extract()
and is giving me the following output:
['\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t23 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ', '\n ', '\n\t\t\t ', '\n\t\t\t\t24 Feb, 2019 ']
But I do not want any fields that are either newlines, tabs or whitespaces, so I am trying to use the normalize-space()
function, as follows:
>>> response.xpath("normalize-space(//ul[@class='meta-info d-flex flex-wrap align-items-center list-unstyled justify-content-around']/li[position()=2]/text())").extract()
But I am getting a null output:
['']
What is happening here?
Upvotes: 0
Views: 425
Reputation: 338278
normalize-space()
works on a single string. You are giving it a whole list of nodes.
So it takes the first one, converts that to string, and returns the result. Your first node has a value of '\n '
.
Write a for
loop over //ul[@class='meta-info d-flex flex-wrap align-items-center list-unstyled justify-content-around']/li[position()=2]
and call normalize-string()
on the individual nodes.
Upvotes: 1
Reputation: 1351
I used regex to solve a similar problem, which I included below, if you want to test it. I found that it works well. This question should answer what is happening with normalize-space. It's expected to return an empty string on a text node.
import re
item_text = response.xpath("//ul[@class='meta-info d-flex flex-wrap align-items-center list-unstyled justify-content-around']/li[position()=2]/text()").extract()
re.sub('[\s]{2,}', '\n', "".join(item_text).strip())
Upvotes: 1