Starid
Starid

Reputation: 33

XPath in scrapy returns elements which don't exist

I am creating a new scrapy spider and everything is going pretty good, although I have a problem with one of the websites, where response.xpath is returning objects in the list which doesn't exist in html code:

{"pdf_name": ["\n\t\t\t\t\t\t\t\t\t", "ZZZZZZ", "\n\t\t\t\t\t\t\t\t\t", "PDF", "\n\t\t\t\t\t\t\t\t"],
{"pdf_name": ["\n\t\t\t\t\t\t\t\t\t\t", "YYYYYY", "\n\t\t\t\t\t\t\t\t\t\t", "XXXXXX"]}

As you can see below, these "empty" objects (\t and \n) are not included in HTML tags. If I understand correctly, xpath is including whitespaces before tags:

<div class="inner d-i-b va-t" role="group">
                        <a class="link-to" href="A.pdf" target="_blank">
                                    <i class="offscreen">ZZZZZZ</i>
                                    <span>PDF</span>
                                </a>

                                <div class="text-box">
                                    <a href="A.pdf">
                                        <i class="offscreen">YYYYYY</i>
                                        <p>XXXXXX</p></a>
                                </div>
                            </div>

I know that I can strip() strings and remove white spaces, although it would only mitigate the issue, not remove the main problem, which is including white spaces in results.

Why is it happening? How to limit XPath results only to tags (I thought previously that it is done by default)?

Spider code - parse function (pdf_name is causing problems)

def parse(self, response):

    # Select all links to pdfs
    for pdf in response.xpath('//a[contains(@href, ".pdf")]'):
        item = PdfItem()

        # Create a list of text fields for links to PDFs and their descendants
        item['pdf_name'] = pdf.xpath('descendant::text()').extract()

        yield item

Upvotes: 0

Views: 643

Answers (1)

Tomalak
Tomalak

Reputation: 338228

Whitespace is part of the document. Just because you think it is unimportant does not make it go away.

A text node is a text node, whether it consists of ' ' (the space character) or any other character makes no difference at all.

You can normalize the whitespace with the normalize-space() XPath function:

def parse(self, response):
    for pdf_link in response.xpath('//a[contains(@href, ".pdf")]'):
        item = PdfItem()
        item['pdf_name'] = pdf_link.xpath('normalize-space(.)').extract()
        yield item

First, normalize-space() converts its argument to string, which is done by concatenating all descendant text nodes. Then it trims leading and trailing spaces and collapses any consecutive whitespace (including line breaks) into a single space. Something like this '\n bla \n\n bla ' would become 'bla bla'.

Upvotes: 2

Related Questions