kosta12
kosta12

Reputation: 1

XPath Query: Extracting text between LI & A tags

In google sheets, for my own amusement, I'm trying to display "Galleon in Valley of the Four Winds." as one string from the code below.

I want to do this for about 600 pages, all of which have identical structure in their HTML (without ID's). I'm only ever going to be interested in the first list between the UL tags.

<h3>Source:</h3>
<ul>
  <li>
    <a href='http://www.wowhead.com/npc=62346' target='_blank'>Galleon</a> in Valley of the Four Winds.
  </li>
  <li>
    <a href='/bmah.php'>The Black Market Auction House</a> (rarely)
  </li>  
</ul>

There are many many lists in the source code & not always in the same order, which makes something like IMPORTHTML(B2,"list",3) hard to use.

I can get "Galleon" by itself using this

=IMPORTXML(URL, "//a[@href[starts-with(., 'http://www.wowhead.com/npc')]]")

I tried adding a "//li | " but it brought back all of the lists & not the text that I hoped for, which made sense but I'm at a loss on how to proceed further with this.

=IMPORTXML(URL, "//li | //a[@href[starts-with(.,'http://www.wowhead.com/npc')]]")

I've tried reading through guides & guidelines, but at this point I'm just floundering and a bit lost.

Hope that all made sense, many thanks in advance for the replies.

Upvotes: 0

Views: 120

Answers (2)

kosta12
kosta12

Reputation: 1

Thank you!

I've rebuilt it for Google Sheets & it posts each part in a different cell, however a quick concatenation has built a complete string.

 =IMPORTXML(A5,"//li[a[@href[starts-with(.,""wowhead.com/npc"")]]]")

Upvotes: 0

LMC
LMC

Reputation: 12822

This one is working on your sample

xmllint --html --xpath 'string(//li[a[@href[starts-with(., "http://www.wowhead.com/npc")]]])' test.html

Galleon in Valley of the Four Winds.

Upvotes: 1

Related Questions