Reputation: 610
I have started learning C# recently. MSDN has an example where you make a RSS application by directly getting the XML file, so I tried something of my own, and like most of the times, I didn't got it right. Put the sigh sound here.
As the pages are HTML, I tried looking for HTML to XHTML converters, and I found this one (which is pretty interesting) called HTML-Cleaner.
It replaces unwanted tags with a <dd>
tag, but I wish to skip those tags, so I made a modification of my own:
public override bool Read()
{
bool status = base.Read();
if( status )
{
if( base.NodeType == XmlNodeType.Element )
{
dowrite = false;
// Got a node with prefix. This must be one of those "<o:p>"
// or something else. Skip this node entirely. We want prefix-
// less nodes so that the resultant XML requires no namespace.
foreach (string line in AllowedTags)
{
if (base.Name == line ||
(base.Name == "html" && first == false))
{
dowrite = true;
first = true;
}
}
if( base.Name.IndexOf(':') > 0 )
dowrite=false;
if(!dowrite)
base.Skip();
}
}
return status;
}
The problem is it only prints one <dd>
tag and nothing else. Even if allowed tags are present, it skips them.
Why is this happening? Any help will be greatly appreciated. If you have alternative approaches, please feel free to suggest them.
EDIT : anyway to achieve this???
Upvotes: 0
Views: 297
Reputation: 83125
It looks like the Read
method returns XML nodes, not tags, so the entire contents of any not matching node will be dropped.
If the input is a typical HTML file, at some point during the recursive Read
method the 'head' element will be found. This is not in the AllowedTags list so it, and all its descendent nodes will be Skip
ped.
The same applies to the body
element. It and all its descendents will be skipped.
That leaves the html
element, which matches in your code and so gets inserted into the XML DOM.
Since html
is not in the AllowedTags list, during the HTMLWriter
phase, the html tags will get converted to dd
tags, which is what you describe as your output.
I actually don't go a bundle on the html2xhtmlcleaner code, but I think you need to adapt the writer code rather than the reader code to achieve what you are trying to do.
Upvotes: 2