Scrapy 'normalize-space()' is truncating the whole string

Question

I am scraping an XML document like this:

>>> response.xpath("//ul[@class='meta-info d-flex flex-wrap align-items-center list-unstyled justify-content-around']/li[position()=2]/text()").extract()

and is giving me the following output:

['
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				23 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ', '
            ', '
			                ', '
				24 Feb, 2019        ']

But I do not want any fields that are either newlines, tabs or whitespaces, so I am trying to use the normalize-space() function, as follows:

>>> response.xpath("normalize-space(//ul[@class='meta-info d-flex flex-wrap align-items-center list-unstyled justify-content-around']/li[position()=2]/text())").extract()

But I am getting a null output:

['']

What is happening here?

Matts · Accepted Answer

I used regex to solve a similar problem, which I included below, if you want to test it. I found that it works well. This question should answer what is happening with normalize-space. It's expected to return an empty string on a text node.

import re
item_text = response.xpath("//ul[@class='meta-info d-flex flex-wrap align-items-center list-unstyled justify-content-around']/li[position()=2]/text()").extract()
re.sub('[\s]{2,}', '
', "".join(item_text).strip())

Scrapy 'normalize-space()' is truncating the whole string

Answers (2)

Related Questions

Scrapy &#39;normalize-space()&#39; is truncating the whole string

Answers (2)

Related Questions

Scrapy 'normalize-space()' is truncating the whole string