Isius
Isius

Reputation: 6974

How do I retrieve multiple row node data from an html table in XPATH?

Sometime during the dark ages a script was built that outputs the following html..

...
<TABLE BORDER=0 FRAME=ALL_FRAMES RULES=ALL_RULES ALIGN=CENTER BGCOLOR="ffffe5">
<CAPTION ALIGN=TOP>
<FONT  COLOR=009594 SIZE=-1><B>Access Information</B></FONT>
</CAPTION>
<TR>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
<FONT  COLOR=black SIZE=-1><B>Access Circuit(s):</B></FONT>
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
**DATA TO COLLECT 111**
</TD>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
<FONT  COLOR=black SIZE=-1><B>Other Circuit(s):</B></FONT>
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
&nbsp
</TD>
</TR>
<TR>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
&nbsp
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
**DATA TO COLLECT AAA**
</TD>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
&nbsp
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
&nbsp
</TD>
</TR>
<TR>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
&nbsp
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
**DATA TO COLLECT BBB**
</TD>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
&nbsp
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
&nbsp
</TD>
</TR>
<TR>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
&nbsp
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
**DATA TO COLLECT CCC**
</TD>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
&nbsp
</TD>
<TD ALIGN=LEFT VALIGN=MIDDLE>
&nbsp
</TD>
</TR>
<TR>
<TD ALIGN=RIGHT VALIGN=MIDDLE>
<FONT  COLOR=black SIZE=-1><B>Customer:</B></FONT>
</TD>
...

Sorry, I would show you the table layout but I don't know how without <table> on SO

How can I use XPATH (in PHP) to collect only each DATA TO COLLECT section? So far I've been able to retrieve the first row with //*[*='Access Circuit(s):']/following-sibling::td[1].

Things to note:

Upvotes: 1

Views: 631

Answers (1)

Petr Janeček
Petr Janeček

Reputation: 38424

The expression I came up with is this:

//TR[(.//B[.='Access Circuit(s):']) or ((./preceding-sibling::TR//B[.='Access Circuit(s):']) and (./following-sibling::TR//B[.='Customer:']))]//TD[2]

returns

<TD ALIGN="LEFT" VALIGN="MIDDLE">**DATA TO COLLECT 111**</TD>
<TD ALIGN="LEFT" VALIGN="MIDDLE">**DATA TO COLLECT AAA**</TD>
<TD ALIGN="LEFT" VALIGN="MIDDLE">**DATA TO COLLECT BBB**</TD>
<TD ALIGN="LEFT" VALIGN="MIDDLE">**DATA TO COLLECT CCC**</TD>

It uses the knowledge that the first row contains Access Circuit(s): and the first uncollected row contains Customer:. If you can't be sure of either one of those, then I think it can't be done with a single XPath expression.

Step-by-step
1. //TR[
2.     (.//B[.="Access Circuit(s):"])
3.     or (     (./preceding-sibling::TR//B[.="Access Circuit(s):"])
4.          and (./following-sibling::TR//B[.="Customer:"]) )
5.     ]//TD[2]

Means
1. all TR nodes
2. that either contain "Access Circuit(s):"
3. or
    - (3.) are positioned after "Access Circuit(s):"
    - (4.) and are positioned before "Customer:"
5. all TD nodes that are the second TD of their parents

Upvotes: 1

Related Questions