August
August

Reputation: 487

How to define an xpath expression that only retrieves hyphenated elements from the first of two similar divs?

The divs below appear in that order in the HTML I am parsing.

//div[contains(@class,'top-container')]//font/text()

I'm using the xpath expression above to try to get any data in the first div below in which a hyphen is used to delimit the data:

Wednesday - Chess at Higgins Stadium
Thursday - Cook-off

The problem is I am getting data from the second div below such as:

Monday 10:00 - 11:00
Tuesday 10:00 - 11:00

How do I only retrieve the data from the first div? (I also want to exclude any elements in the first div that do not contain this hyphenated data)?

<div class="top-container"> 
<div dir="ltr"> 
<div dir="ltr"><font face="Arial" color="#000000" size="2">Wednesday - Chess at Higgins Stadium</font></div> 
<div dir="ltr"><font face="Arial" size="2">Thursday - Cook-off</font></div> 
<div dir="ltr"><font face="Arial" size="2"></font>&nbsp;</div> 
<div dir="ltr">&nbsp;</div> 
<div dir="ltr"><font face="Arial" color="#000000" size="2"></font>&nbsp;</div>
</div> 

<div dir="ltr"> 
<div RE><font face="Arial"> 
<div dir="ltr"> 
<div RE><font face="Arial" size="2"><strong>Alex Dawkin </strong></font></div> 
<div RE><font face="Arial" size="2">Monday 10:00 - 11:00 </font></div> 
<div RE><font size="2">Tuesday 10:00 - 11:00 </font></div> 
<div RE> 
<div RE><font face="Arial" size="2"></font></div><font face="Arial" size="2"></font></div> 
<div RE>&nbsp;</div> 
<div RE>&nbsp;</div> 

Upvotes: 3

Views: 1460

Answers (1)

Mads Hansen
Mads Hansen

Reputation: 66714

Your XPATH was matching on any font element that is a descendant of <div class="top-container">.

div[1] will address the first div child element of the "top-container" element. If you add that to your XPATH, it will return the desired results.

//div[contains(concat(' ',@class,' '),' top-container '))]/div[1]//font/text()

If you want to ensure that only text() nodes that contain "-" are addressed, then you should also add a predicate filter to the text().

//div[contains(concat(' ',@class,' '),' top-container '))]/div[1]//font/text()[contains(.,'-')]

Instead of checking only for nodes that contain "-", how would you modify the last expression to just check for non-empty strings?

If you want to return any text() node with a value, then the predicate filter on text() is not necessary. If a text node doesn't have content, then it isn't a text node and won't be selected.

However, if you only want to select text() nodes that contain text other than whitespace, you could use this expression:

//div[contains(concat(' ',@class,' '),' top-container '))]/div[1]//font/text()[normalize-space()]

normalize-space() removes any leading and trailing whitespace characters. So, if the text() only contained whitespace(including &nbsp;), the result would be nothing and evaluate to false() in the predicate filter, so only text() containing something other than whitespace will be selected.

Upvotes: 1

Related Questions