Tim Schmelter
Tim Schmelter

Reputation: 460380

Get only InnerText of this node excluding children

Since i'm still not that familiar with XPath i'm prefering LINQ with HtmlAgilityPack. I think this is one of the cases where i need an XPath solution. So i need your help.

Consider this simplified HTML snippet:

<td><b>Billing informations:</b>
    <table>
        <tr>
            <td style="color: #757575; padding-left: 10px; padding-bottom: 20px;">
                Invoice-Number:1534753<br />Transactioncode: 1WF772582A4041717
            </td>
        </tr>
    </table>
</td>

This is part of a larger HTML page but it demonstrates the issue i have. I need to extract the Invoice-Number and TransactionCode. Sometimes the text is in a span and sometimes directly in the cell like here. So i need a way that works in both cases.

I've tried this:

var invoiceCell = doc.DocumentNode.Descendants("td")
    .FirstOrDefault(cell => cell.InnerText.Contains("Invoice-Number"));
if (invoiceCell != null)
{
    string text = invoiceCell.InnerText;
    // use string methods to extract both values
}

The problem is that invoiceCell.InnerText returns the outermost cell's InnerText, not the cell that contains the Invoice-Number. So text contains also "Billing informations":

Billing informations:



                Invoice-Number:1534753Transactioncode: 1WF772582A4041818

While i could use string methods or regex to extract both values in this case this is very error-prone since the larger html page contains many nested tables. I just want the InnerText of the inner-most cell. Maybe there's also a LINQ solution to solve this issue, then i'd prefer that.

Update i've noticed that using LastOrDefault instead of FirstOrDefault might be a viable workaround because that seems to return the innermost cell that matches the condition:

var invoiceCell = doc.DocumentNode.Descendants("td")
    .LastOrDefault(cell => cell.InnerText.Contains("Invoice-Number"));

Upvotes: 4

Views: 402

Answers (1)

har07
har07

Reputation: 89335

Here is another alternative using XPath to cover both cases --when the target text is directly inside the cell and when the same is wrapped in a span :

var xpath = "//td[contains(text(),'Invoice-Number') or contains(span,'Invoice-Number')]";
var invoiceCell = doc.DocumentNode.SelectSingleNode(xpath);

Upvotes: 1

Related Questions