TarunG
TarunG

Reputation: 610

HTML to XHTML - skip some tags completely. (C# beginner)

I have started learning C# recently. MSDN has an example where you make a RSS application by directly getting the XML file, so I tried something of my own, and like most of the times, I didn't got it right. Put the sigh sound here.

As the pages are HTML, I tried looking for HTML to XHTML converters, and I found this one (which is pretty interesting) called HTML-Cleaner.

It replaces unwanted tags with a <dd> tag, but I wish to skip those tags, so I made a modification of my own:

public override bool Read()
{
  bool status = base.Read();
  if( status )
  {
    if( base.NodeType == XmlNodeType.Element )
    {
      dowrite = false;
      // Got a node with prefix. This must be one of those "<o:p>"
      // or something else.  Skip this node entirely. We want prefix-
      // less nodes so that the resultant XML requires no namespace.
      foreach (string line in AllowedTags)
      {
        if (base.Name == line || 
           (base.Name == "html" && first == false))
        { 
            dowrite = true; 
            first = true; 
        }
      } 

      if( base.Name.IndexOf(':') > 0 )
        dowrite=false;

      if(!dowrite)
        base.Skip();
    }
  }
    return status;
}

The problem is it only prints one <dd> tag and nothing else. Even if allowed tags are present, it skips them.

Why is this happening? Any help will be greatly appreciated. If you have alternative approaches, please feel free to suggest them.


EDIT : anyway to achieve this???

Upvotes: 0

Views: 297

Answers (1)

Alohci
Alohci

Reputation: 83125

It looks like the Read method returns XML nodes, not tags, so the entire contents of any not matching node will be dropped.

If the input is a typical HTML file, at some point during the recursive Read method the 'head' element will be found. This is not in the AllowedTags list so it, and all its descendent nodes will be Skipped.

The same applies to the body element. It and all its descendents will be skipped.

That leaves the html element, which matches in your code and so gets inserted into the XML DOM.

Since html is not in the AllowedTags list, during the HTMLWriter phase, the html tags will get converted to dd tags, which is what you describe as your output.

I actually don't go a bundle on the html2xhtmlcleaner code, but I think you need to adapt the writer code rather than the reader code to achieve what you are trying to do.

Upvotes: 2

Related Questions