Reputation: 203
I have been trying to parse webpages by use of the HTML DOMObject in order to use them for an application to scan them for SEO quality.
However I have run into a bit of a problem. For testing purposes I've written a small HTML page containing the following incorrect HTML:
<head>
<meta name="description" content="randomdesciption">
</head>
<title>sometitle</title>
As you can see the title is outside the head tag which is the error I am trying to detect.
Now comes the problem, when I use cURL to catch the response string from this page then send it to the DOM document to load it as HTML it actually fixes this by ADDING another <head>
and </head>
tags around the title.
<head>
<meta name="description" content="randomdesciption">
</head>
<head><title>sometitle</title></head>
I have checked the cURL response data and that in fact is not the problem, somehow the PHP DOMDocument during the execution of the loadHTML() method fixes the html syntax.
I have also tried turning off the DOMDocument recover, substituteEntities and validateOnParse attributes by setting them to false, without success.
I have been searching google but I am unable to find any answers so far. I guess it is a bit rare for some one that actually want the broken HTML not being fixed.
Anyone know how to prevent the DOMDocument from fixing my broken HTML?
Upvotes: 8
Views: 4287
Reputation: 316969
UPDATE: as of PHP 5.4 you can use HTML_PARSE_NO_IMPLIED
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED);
Original answer below
You cant. In theory there is a flag HTML_PARSE_NO_IMPLIED
for that in libxml to prevent adding implied markup, but its not accessible from PHP.
On a sidenote, this particular behavior seems to depend on the LIBXML_VERSION
used.
Running this snippet:
<?php
$html = <<< HTML
<head>
<meta name="description" content="randomdesciption">
</head>
<title>sometitle</title>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html);
$dom->formatOutput = true;
echo $dom->saveHTML(), LIBXML_VERSION;
on my machine will give
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta name="description" content="randomdesciption"></head>
<title>sometitle</title>
</html>
20707
Upvotes: 9