Reputation: 139
I'm new to xpath and I'm working with scrapy to get text from different html pages that are generated.
I get the {id} of a header tag from the user (<h1|2|.. id="title-{id}">text</h1|2|3..>
). I need to get text from all html tags between this header and the next header of same level. So if the header is h1 I need to get all text of all tags until next h1 header.
All headers ids have same pattern "title-{id}" where {id} is generated.
To make it more clear here is an example :
<html>
<body>
...
<h2 id="tittle-id1">id1</h2>
bunch of tags containing text I want to get
<h2 id="tittle-id2">id2</h2>
...
</body>
</html>
NOTE : I don't know what header it might be. It could be any of the html header tags from <h1>
to <h6>
UPDATE :
While trying few things around I noticed that I'm not sure if the next header is of same level or even exists. Since the headers are used as titles and sub-titles. The given id may be of last sub-title hence I'll have a header of higher level after or even be the last of the page. So basicaly I only have the id of the header and I need to get all text of the "paragraph".
Work Around :
I found a kindof workaround solution :
I do it in 3 steps :
First, I use //*[@id='title-{id}]
which allows me to get the full line with the tag so now I know which tag header it is.
Second, I use //*[id='title-{id}]/following-sibling::*
this allows to look for next header of same or higher level {myHeader}.
Last, I use //*[id='title-{id}]/following-sibling::*
and //{myHeader}//preceding-sibling::*
to get what's between or go 'till the end of page if no header found.
Upvotes: 0
Views: 2285
Reputation: 14135
Here is the xpath to get all the elements between h2 tags.
//h2/following-sibling::*[count(following-sibling::h2)=1]
Here is the sample html I used to simulate the scenario. (update the id to check different options shown in the below).
//[@id='tittle-id1' ]/following::[count(following-sibling::[name()=name(preceding-sibling::[@id='tittle-id1'])])=1]
<html><head></head><body>
...
<h2 id="tittle-id1">id1</h2>
<h3 id="tittle-id3"> h3 tag</h3>
<h4 id="tittle-id4"> h4 tag</h4>
<h3 id="tittle-id5"> 2nd h3 tag</h3>
bunch of tags containing text I want to get
<h5 id="tittle-id6"> h5 tag </h5>
<h2 id="tittle-id2">id2</h2>
<h4 id="tittle-id7"> 2nd h4 tag</h4>
...
</body></html>
Note: This xpath is designed to suite the original post scenario.
Upvotes: 2
Reputation: 139
This is what worked for me :
For this keep in mind that I'm using scrapy with python-2.7 :
name_query = u"//*[name()=name(//*[@id='"+id+"'])]"
all = response.xpath(name_query)
for selector in all.getall():
if self.id in selector:
position = all.getall().index(selector)
balise = "h" + all.getall()[position].split("<h")[1][0]
title = all.getall()[position].split(">")[1].split("<")[0]
query = u"//*[preceding-sibling::"+balise+"[1] ='"+title+"' and following-sibling::"+balise+"]"
self.log('query = '+query)
results = response.xpath(query)
results.pop(len(results)-1)
with open(filename,'wb') as f:
for text in results.css("::text").getall():
f.write(text.encode('utf-8')+"\n")
This should work in general I tested it against multiple headers wih different levels it works fine for me.
Upvotes: 0
Reputation: 1882
Because predicates in XPath filter the context node list you can't perform a join selection unless you are able to reintroduce target values from a relative context of your source values. Example selecting all the elements with the same name as that having specific id
attribute:
//*[name()=name(//*[@id=$generated-id-string])]
Now, for the in "between marks problem" use as usually the Kaysian method for intersection:
//*[name()=name(//*[@id=$generated-id-string])]/preceding-sibling::node()[
count(.|//*[@id=$generated-id-string]/following-sibling::node())
=
count(//*[@id=$generated-id-string]/following-sibling::node())
]
Test in http://www.xpathtester.com/xpath/0dcfdf59dccb8faf3705c22167ae45f1
Upvotes: 1