Julian Baehr
Julian Baehr

Reputation: 27

How to use XPATH for parsing HTML lists?

I want to scrape some webpages. I am using scrapy for this. Everything works fine, but I want to 'find' a field containing numbers, which sometimes is the second, the third or the fourth 'li' in the list. Perhaps I can show you the code from the webpage:

<ul class="basic-product-information key-value-list">
        <li>
            <span class="key">Sprache:</span>
            <strong class="value">Unbekannt</strong>
        </li>
        <li>
            <span class="key">Plattform:</span>
            <span class="value">Bücher</span>
        </li>
        <li>
            <span class="key">EAN / ISBN:</span>
            <span class="value">9783442158126</span>
        </li>
</ul>

The value I want to get as result is 9783442158126.

At the moment I am locating the table with this:

//*[@id="book-info"]/ul/li[x]/span[2]

I am parsing all the 'li' (1, 2, 3, 4, 5) and then I get a CSV which I have to edit by hand, because I just need the ISBN - not the other things.

Is there a way to automat this? Perhaps I can tell XPATH to search for 13 digit numbers?

Thank you very much.

Best regards, Julian

Upvotes: 0

Views: 75

Answers (1)

Birei
Birei

Reputation: 36282

You could use and implicit and, concatenating expression between square brackets, and check:

1.- Its length with string-length() function.
2.- It's a number converting with number() function and comparing. It wont match for booleans because false is 0 whereas true is 1, and neither for strings because they will be NaN, that is different from NaN, so try with:

//ul/li/span[2][number(text()) = number(text())][string-length() = 13]

UPDATE: To achieve the new requirement asked in comments, the easiest path is to use the or condition translated as | in xpath. To match the last X use substring-before() to get the number an increment the string-length by one:

//ul/li/span[2][number(text()) = number(text())][string-length() = 13] |
  //ul/li/span[2][number(substring-before(text(), "X")) = number(substring-before(text(), "X"))][string-length() = 14]

Upvotes: 1

Related Questions