Reputation: 2221

Avoiding a space leak reading an HTML document with HXT

Link to truncated version of example document

I'm trying to extract the large chunk of text in the last "pre", process it, and output it. For the purposes of argument, let's say I want to apply

concatMap (unwords . take 62 . drop 11) . lines

to the text and output it.

This takes over 400M of space on a 4M html document when I do it.

~~The code I have is pretty simple, so I'm not including it for fear of biasing responses.~~
Here is one iteration of the code:

file = readDocument [(a_validate, v_0), (a_parse_html, v_1)] "Cache entry information.xhtml"
text = fmap last $ runX $
  file >>>
  deep (hasName "pre") />
  isText >>>
--  changeText (unwords . take 62 . drop 11 . lines) >>>
  getText

I think the problem is that the way I'm doing it, HXT is trying to keep all the text in memory as it reads it.

According to this it appears that HXT needs to at least read the whole document, although not to store it in memory.

I'm going to try other parsers, HaXmL, being the next one.
N.B. I have solved the initial problem by treating the input file as plain text and the desired portion a delimited by "<pre>00000000:" and "</pre></body>\n</html>"

Upvotes: 1

Answers (2)

fuz

Reputation: 92984

Try to use a ByteString of the module Data.Bytestring.Lazy. The usual string is optimized for recursion and behaves pretty bad in case of large amounts of data. Also you can try to make your functions more strict (eg. using seq) to avoid large overhead due to unevaluated thunks. But be carefull as this may make things even worser if applied wrong.

PS: It's always a good idea to supply a brief example.

Upvotes: 0

Stephen Tetley

Reputation: 106

Is HXT's parser an "online" parser?

The example you have have works fine for String, provided each line isn't pathologically long:

unwords . take 62 . drop 11 . lines

Here you will only consume 73 lines of input, 11 that you drop and 62 that you operate on. However the example is mostly irrelevant to XML processing. If HXT's parser is not an online parser you will have to read the whole file into memory before you can operator on any embedded string data.

I'm afraid I don't whether or not HXT is a online parser, but that would seem to be the crux of your problem.

Upvotes: 1

Avoiding a space leak reading an HTML document with HXT

Answers (2)

Related Questions