Björn
Björn

Reputation: 203

How to prevent the PHP DOMDocument from "fixing" your HTML string

I have been trying to parse webpages by use of the HTML DOMObject in order to use them for an application to scan them for SEO quality.

However I have run into a bit of a problem. For testing purposes I've written a small HTML page containing the following incorrect HTML:

<head>
<meta name="description" content="randomdesciption">
</head>
<title>sometitle</title>

As you can see the title is outside the head tag which is the error I am trying to detect.

Now comes the problem, when I use cURL to catch the response string from this page then send it to the DOM document to load it as HTML it actually fixes this by ADDING another <head> and </head> tags around the title.

<head>
<meta name="description" content="randomdesciption">
</head>
<head><title>sometitle</title></head>

I have checked the cURL response data and that in fact is not the problem, somehow the PHP DOMDocument during the execution of the loadHTML() method fixes the html syntax.

I have also tried turning off the DOMDocument recover, substituteEntities and validateOnParse attributes by setting them to false, without success.

I have been searching google but I am unable to find any answers so far. I guess it is a bit rare for some one that actually want the broken HTML not being fixed.

Anyone know how to prevent the DOMDocument from fixing my broken HTML?

Upvotes: 8

Views: 4287

Answers (1)

Gordon
Gordon

Reputation: 316969

UPDATE: as of PHP 5.4 you can use HTML_PARSE_NO_IMPLIED

$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED);

Original answer below

You cant. In theory there is a flag HTML_PARSE_NO_IMPLIED for that in libxml to prevent adding implied markup, but its not accessible from PHP.

On a sidenote, this particular behavior seems to depend on the LIBXML_VERSION used.

Running this snippet:

<?php
$html = <<< HTML
<head>
<meta name="description" content="randomdesciption">
</head>
<title>sometitle</title>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html);
$dom->formatOutput = true;
echo $dom->saveHTML(), LIBXML_VERSION;

on my machine will give

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta name="description" content="randomdesciption"></head>
<title>sometitle</title>
</html>
20707

Upvotes: 9

Related Questions