chris01
chris01

Reputation: 12377

Perl - XML::LibXML: bad parsing performance on Apache2 default page

I was testing some code and parsing XML was included. For simple testing I requested / of my localhost and the response was my Apache2 default page. So far, so good.

The response is XHTML and therefore XML. So I took it for my parsing (~11k of size).

XML::LibXML->load_xml (string => $response);

It takes about 16s till it finishes with no error.

If I give it an other xml-file with double the size if need 0 time.

So...why????

Apache/2.4.10
Debian/8.6
XML::LibXML/2.0128

EDIT

I need to mention that I removed the non-XML HTTP-header.

So the string starts with

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 

and ends with

</html>

EDIT

Link: http://s000.tinyupload.com/index.php?file_id=88759644475809123183

Upvotes: 0

Views: 193

Answers (1)

Grant McLean
Grant McLean

Reputation: 7008

One possibility is that each time you parse the document the parser is downloading the DTD from W3C. You could confirm this using strace or similar tools depending on your platform.

The DTD contains (among other things) the named entity definitions which map for example the string &nbsp; to the character U+00A0. So in order to parse HTML documents, the parser does need the DTD, however fetching it via HTTP each time is obviously not a good idea.

One approach is to install a copy of the DTD locally and use that. On Debian/Ubuntu systems you can just install the w3c-dtd-xhtml package which also sets up the appropriate XML catalog entries to allow libxml to find it.

Another approach is to use XML::LibXML->load_html instead of XML::LibXML->load_xml. In HTML parsing mode, the parser is more forgiving of markup errors and I think also always uses a local copy of the DTD.

The parser also provides options which allow you to specify your own handler routine for retrieving reference URIs.

Upvotes: 1

Related Questions