How to use XPATH for parsing HTML lists?

Question

I want to scrape some webpages. I am using scrapy for this. Everything works fine, but I want to 'find' a field containing numbers, which sometimes is the second, the third or the fourth 'li' in the list. Perhaps I can show you the code from the webpage:


        
            Sprache:
            Unbekannt
        
        
            Plattform:
            Bücher
        
        
            EAN / ISBN:
            9783442158126

The value I want to get as result is 9783442158126.

At the moment I am locating the table with this:

//*[@id="book-info"]/ul/li[x]/span[2]

I am parsing all the 'li' (1, 2, 3, 4, 5) and then I get a CSV which I have to edit by hand, because I just need the ISBN - not the other things.

Is there a way to automat this? Perhaps I can tell XPATH to search for 13 digit numbers?

Thank you very much.

Best regards, Julian

Birei · Accepted Answer

You could use and implicit and, concatenating expression between square brackets, and check:

1.- Its length with string-length() function.
2.- It's a number converting with number() function and comparing. It wont match for booleans because false is 0 whereas true is 1, and neither for strings because they will be NaN, that is different from NaN, so try with:

//ul/li/span[2][number(text()) = number(text())][string-length() = 13]

UPDATE: To achieve the new requirement asked in comments, the easiest path is to use the or condition translated as | in xpath. To match the last X use substring-before() to get the number an increment the string-length by one:

//ul/li/span[2][number(text()) = number(text())][string-length() = 13] |
  //ul/li/span[2][number(substring-before(text(), "X")) = number(substring-before(text(), "X"))][string-length() = 14]

How to use XPATH for parsing HTML lists?

Answers (1)

Related Questions