Reputation: 139

XPath : How to get text between 2 html tags with same level?

I'm new to xpath and I'm working with scrapy to get text from different html pages that are generated.
I get the {id} of a header tag from the user (<h1|2|.. id="title-{id}">text</h1|2|3..>). I need to get text from all html tags between this header and the next header of same level. So if the header is h1 I need to get all text of all tags until next h1 header.
All headers ids have same pattern "title-{id}" where {id} is generated.
To make it more clear here is an example :

<html>
    <body>
        ...
        <h2 id="tittle-id1">id1</h2>
        bunch of tags containing text I want to get
        <h2 id="tittle-id2">id2</h2>
        ...
    </body>
</html>

NOTE : I don't know what header it might be. It could be any of the html header tags from <h1> to <h6>

UPDATE :
While trying few things around I noticed that I'm not sure if the next header is of same level or even exists. Since the headers are used as titles and sub-titles. The given id may be of last sub-title hence I'll have a header of higher level after or even be the last of the page. So basicaly I only have the id of the header and I need to get all text of the "paragraph".

Work Around :
I found a kindof workaround solution :
I do it in 3 steps :
First, I use //*[@id='title-{id}] which allows me to get the full line with the tag so now I know which tag header it is.
Second, I use //*[id='title-{id}]/following-sibling::* this allows to look for next header of same or higher level {myHeader}.
Last, I use //*[id='title-{id}]/following-sibling::* and //{myHeader}//preceding-sibling::* to get what's between or go 'till the end of page if no header found.

Upvotes: 0

Answers (3)

supputuri

Reputation: 14135

Here is the xpath to get all the elements between h2 tags.

//h2/following-sibling::*[count(following-sibling::h2)=1]

Here is the sample html I used to simulate the scenario. (update the id to check different options shown in the below).

//[@id='tittle-id1' ]/following::[count(following-sibling::[name()=name(preceding-sibling::[@id='tittle-id1'])])=1]

<html><head></head><body>
 
        ...
        <h2 id="tittle-id1">id1</h2>
		  <h3 id="tittle-id3"> h3 tag</h3>
		  <h4 id="tittle-id4"> h4 tag</h4>
		  <h3 id="tittle-id5"> 2nd h3  tag</h3>
        bunch of tags containing text I want to get
		   <h5 id="tittle-id6"> h5 tag </h5>
        <h2 id="tittle-id2">id2</h2>
		<h4 id="tittle-id7"> 2nd h4 tag</h4>
        ...
    
	
</body></html>

output if User input: {id1}

output if user input: {id4}

output if user input: {id3}

Note: This xpath is designed to suite the original post scenario.

Upvotes: 2

Deyesta

Reputation: 139

This is what worked for me :
For this keep in mind that I'm using scrapy with python-2.7 :

name_query = u"//*[name()=name(//*[@id='"+id+"'])]"
all = response.xpath(name_query)
for selector in all.getall():
     if self.id in selector:
          position = all.getall().index(selector)
balise = "h" + all.getall()[position].split("<h")[1][0]
title = all.getall()[position].split(">")[1].split("<")[0]
query = u"//*[preceding-sibling::"+balise+"[1] ='"+title+"' and following-sibling::"+balise+"]"
self.log('query = '+query)
results = response.xpath(query)
results.pop(len(results)-1)
with open(filename,'wb') as f:
    for text in results.css("::text").getall():
        f.write(text.encode('utf-8')+"\n")

This should work in general I tested it against multiple headers wih different levels it works fine for me.

Upvotes: 0

Alejandro

Reputation: 1882

Because predicates in XPath filter the context node list you can't perform a join selection unless you are able to reintroduce target values from a relative context of your source values. Example selecting all the elements with the same name as that having specific id attribute:

//*[name()=name(//*[@id=$generated-id-string])]

Now, for the in "between marks problem" use as usually the Kaysian method for intersection:

//*[name()=name(//*[@id=$generated-id-string])]/preceding-sibling::node()[
   count(.|//*[@id=$generated-id-string]/following-sibling::node())
      =
   count(//*[@id=$generated-id-string]/following-sibling::node())
]

Test in http://www.xpathtester.com/xpath/0dcfdf59dccb8faf3705c22167ae45f1

Upvotes: 1

XPath : How to get text between 2 html tags with same level?

Answers (3)

Related Questions