Photonic
Photonic

Reputation: 1421

C# What is #text node in htmlnode?

I am trying to go through each html node and get its attribute and innerText. At the moment when I am scanning through any html I am getting this stupid #text node even though it doesn't exist.

Here is my html

<div class="demographic-info adr editable-item" id="demographics">
  <div id="location-container" data-li-template="location">
    <div id="location" class="editable-item">
      <dl>
        <dt>Location</dt>
        <dd>
          <span class="locality">Bolton, United Kingdom</span>
        </dd>
        <dt>Industry</dt>
        <dd class="industry">Computer Games</dd>
      </dl>
    </div>
  </div>
</div>

And here is my c#

foreach (HtmlNode node in j.ChildNodes)
    if (node.HasChildNodes)
        checkNode(node);

static void checkNode(HtmlNode node)
{
    foreach (HtmlNode n in node.ChildNodes)
    {
        if (n.HasChildNodes)
            checkNode(n);
        else
        {
            HtmlNode nodeValue = hasValueInNode(n);
            if (nodeValue != null)
                addCategories(nodeValue);
        }   
    }
}

When I go through debug mode to check which node the compiler is at and I get this:

1 = div, 2 = #text, 3 = div, 4 = #text, 5 = div, 6 = #text, 7 = dl ... and so on!

I am guessing that is detecting blank space or return space as a node but this is such a waste of loops. Can someone explain this to me and a way to avoid it. Thanks

Upvotes: 0

Views: 804

Answers (1)

Sami Kuhmonen
Sami Kuhmonen

Reputation: 31143

This is how HTML/XML works. There is a text node every time there is some text inside a node. In this case it happens to be whitespace, but it is still text and it cannot be discarded. The node is not "stupid" and it does exist.

Your code is free to check if the text node is whitespace and ignore it if you want to, or you can make the XML so that there isn't any whitespace.

Just as a thought: how would you tell the parser which whitespace should be important:

<div>
  <div>Test<span>
  </span>test</div>
</div>

So, should the parser just be "there's Test and then there's empty span element and then test, so actualy the text inside is 'Testtest'"? Or how would it know what to do?

Upvotes: 1

Related Questions