Reputation: 38714
Nearly all browsers use a certain amount of leeway in rendering invalid HTML. For example, they would render x < y
as if it were written x < y
because it is "clear" that the <
is intended as a literal character, not part of an HTML tag.
Where can I find that logic as a separate "cleanup" module? Such a module would convert x < y
to x < y
Upvotes: 2
Views: 281
Reputation: 95548
Try looking at the source code for Tidy.
HTML before running through Tidy:
<html>
<head>
<title>boo</title>
</head>
<body>
x < y
</body>
</html>
Same HTML after running through Tidy:
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">
<title>boo</title>
</head>
<body>
x < y
</body>
</html>
Notice that x < y
was changed to x < y
.
UPDATE
Based on your comment, you should probably use Tidy to clean up your HTML. I believe there are Tidy libraries for most of the common languages, that will clean up your HTML for you. If you are using PHP, there is PHP Tidy.
UPDATE
I noticed that you said you're using C#. You can use Tidy with C# as well. Here's something I found. I don't develop in C# and I haven't tried this out so YMMV:
Fix Up Your HTML with HTML Tidy and .NET
Upvotes: 3
Reputation: 943759
The HTML 5 (draft) specification includes a detailed parsing algorithm based on how browsers handle bad markup.
Upvotes: 0
Reputation: 14561
Edit: I am assuming you're using PHP, since you didn't specify
Use strip_tags:
$content = strip_tags($content, array('<b><i>'));
This will leave safe tags (as defined by you), and remove everything else.
Upvotes: -1
Reputation: 23794
Rendering of invalid HTML in browsers is horrible guesswork, and you really shouldn't try to emulate it (it will break). However, replacing some occurrences could be done with a regexp:
preg_replace('/(\s)<(\s)/', '$1<$2', $data);
Upvotes: 0
Reputation: 2456
Not sure what do you mean exactly, but maybe the PHP function htmlentities could help you.
Upvotes: 0