Reputation: 2142
I have an xml file that has a very large text node (>10 MB). While reading the file, is it possible to skip (ignore) this node?
I tried the following:
reader = XML::Reader.io(path)
while reader.read do
next if reader.name.eql?('huge-node')
end
But this still results in the error parser error : xmlSAX2Characters: huge text node
The only other solution I can think of is to first read the file as a string and remove the huge node through a gsub, and then parse the file. However, this method seems very inefficient.
Upvotes: 0
Views: 461
Reputation: 513
You don't have to skip the node. The cause is that since version 2.7.3 libxml limits the maximum size of a single text node to 10MB. This limit can be removed with a new option, XML_PARSE_HUGE.
Bellow an example:
# Reads entire file into a string
$result = file_get_contents("https://www.ncbi.nlm.nih.gov/gene/68943?report=xml&format=text");
# Returns the xml string into an object
$xml = simplexml_load_string($result, 'SimpleXMLElement', LIBXML_COMPACT | LIBXML_PARSEHUGE);
Upvotes: 0
Reputation: 398
That's probably because by the time you are trying to skip it, it's already read the node. According to the documentation for the #read
method:
reader.read -> nil|true|false
Causes the reader to move to the next node in the stream, exposing its properties.
Returns true if a node was successfully read or false if there are no more nodes to read. On errors, an exception is raised.
You would need to skip the node prior to calling the #read
method on it. I'm sure there are many ways you could do that but it doesn't look like this library supports XPath expressions, or I would suggest something like that.
EDIT: The question was clarified so that the SAX parser is a required part of the solution. I have removed links that would not be helpful given this constraint.
Upvotes: 1