Reputation: 460380
Since i'm still not that familiar with XPath i'm prefering LINQ with HtmlAgilityPack
. I think this is one of the cases where i need an XPath solution. So i need your help.
Consider this simplified HTML snippet:
<td><b>Billing informations:</b>
<table>
<tr>
<td style="color: #757575; padding-left: 10px; padding-bottom: 20px;">
Invoice-Number:1534753<br />Transactioncode: 1WF772582A4041717
</td>
</tr>
</table>
</td>
This is part of a larger HTML page but it demonstrates the issue i have. I need to extract the Invoice-Number
and TransactionCode
. Sometimes the text is in a span and sometimes directly in the cell like here. So i need a way that works in both cases.
I've tried this:
var invoiceCell = doc.DocumentNode.Descendants("td")
.FirstOrDefault(cell => cell.InnerText.Contains("Invoice-Number"));
if (invoiceCell != null)
{
string text = invoiceCell.InnerText;
// use string methods to extract both values
}
The problem is that invoiceCell.InnerText
returns the outermost cell's InnerText
, not the cell that contains the Invoice-Number
. So text
contains also "Billing informations":
Billing informations:
Invoice-Number:1534753Transactioncode: 1WF772582A4041818
While i could use string methods or regex to extract both values in this case this is very error-prone since the larger html page contains many nested tables. I just want the InnerText
of the inner-most cell. Maybe there's also a LINQ solution to solve this issue, then i'd prefer that.
Update i've noticed that using LastOrDefault
instead of FirstOrDefault
might be a viable workaround because that seems to return the innermost cell that matches the condition:
var invoiceCell = doc.DocumentNode.Descendants("td")
.LastOrDefault(cell => cell.InnerText.Contains("Invoice-Number"));
Upvotes: 4
Views: 402
Reputation: 89335
Here is another alternative using XPath to cover both cases --when the target text is directly inside the cell and when the same is wrapped in a span :
var xpath = "//td[contains(text(),'Invoice-Number') or contains(span,'Invoice-Number')]";
var invoiceCell = doc.DocumentNode.SelectSingleNode(xpath);
Upvotes: 1