Reputation: 487
The divs below appear in that order in the HTML I am parsing.
//div[contains(@class,'top-container')]//font/text()
I'm using the xpath expression above to try to get any data in the first div below in which a hyphen is used to delimit the data:
Wednesday - Chess at Higgins Stadium
Thursday - Cook-off
The problem is I am getting data from the second div below such as:
Monday 10:00 - 11:00
Tuesday 10:00 - 11:00
How do I only retrieve the data from the first div? (I also want to exclude any elements in the first div that do not contain this hyphenated data)?
<div class="top-container">
<div dir="ltr">
<div dir="ltr"><font face="Arial" color="#000000" size="2">Wednesday - Chess at Higgins Stadium</font></div>
<div dir="ltr"><font face="Arial" size="2">Thursday - Cook-off</font></div>
<div dir="ltr"><font face="Arial" size="2"></font> </div>
<div dir="ltr"> </div>
<div dir="ltr"><font face="Arial" color="#000000" size="2"></font> </div>
</div>
<div dir="ltr">
<div RE><font face="Arial">
<div dir="ltr">
<div RE><font face="Arial" size="2"><strong>Alex Dawkin </strong></font></div>
<div RE><font face="Arial" size="2">Monday 10:00 - 11:00 </font></div>
<div RE><font size="2">Tuesday 10:00 - 11:00 </font></div>
<div RE>
<div RE><font face="Arial" size="2"></font></div><font face="Arial" size="2"></font></div>
<div RE> </div>
<div RE> </div>
Upvotes: 3
Views: 1460
Reputation: 66714
Your XPATH was matching on any font
element that is a descendant of <div class="top-container">
.
div[1]
will address the first div
child element of the "top-container" element. If you add that to your XPATH, it will return the desired results.
//div[contains(concat(' ',@class,' '),' top-container '))]/div[1]//font/text()
If you want to ensure that only text()
nodes that contain "-" are addressed, then you should also add a predicate filter to the text()
.
//div[contains(concat(' ',@class,' '),' top-container '))]/div[1]//font/text()[contains(.,'-')]
Instead of checking only for nodes that contain "-", how would you modify the last expression to just check for non-empty strings?
If you want to return any text()
node with a value, then the predicate filter on text()
is not necessary. If a text node doesn't have content, then it isn't a text node and won't be selected.
However, if you only want to select text()
nodes that contain text other than whitespace, you could use this expression:
//div[contains(concat(' ',@class,' '),' top-container '))]/div[1]//font/text()[normalize-space()]
normalize-space()
removes any leading and trailing whitespace characters. So, if the text()
only contained whitespace(including
), the result would be nothing and evaluate to false()
in the predicate filter, so only text()
containing something other than whitespace will be selected.
Upvotes: 1