riri
riri

Reputation: 509

Memory leak in a haskell program

I've been trying to read and index a set of links and wikipedia abstracts with a haskell program. The aim is to

  1. Read one line from the file
  2. Parse it
  3. Hash and index some parts of the line, along with the offset into the file so I can find it later

The problem is that my program is very slow and uses large amounts of memory. I've done some profiling and I know most of the time (~50%) is spent in garbage collection. I've tried everything I could think of to get to what I have now, including:

  1. ST monad stuff with Data.Map as the storage
  2. Using hashed values as keys, rather than keeping around entire bytestrings
  3. Using seq and deepseq to try to prevent lazy allocation
  4. Parsing line by line instead of hoping that lazyness would make a list of parsed lines space and time efficient
  5. Using immutable hashmaps rather than Data.Map
  6. Using mutable hashmaps

Code, data, profiling output and compilation stuff: https://gist.github.com/ririw/8205284

Thanks!

Upvotes: 0

Views: 232

Answers (1)

riri
riri

Reputation: 509

Ok, I have managed to reduce the memory use by 100MB or so on my test set (the first 400K lines of the dbpedia links dataset, available at http://downloads.dbpedia.org/3.9/en/wikipedia_links_en.nt.bz2).

I did this by switching to the io-streams library which seems to have solved some of the issues with lazyness.

You can see the new solution at https://gist.github.com/ririw/8207250

On a 7574825 line dataset it does ok, and the memory usage ticks up in the way I'd expect a hash table to tick up as it filled itself and shuffled things around (https://i.sstatic.net/LFwpV.jpg). Of course, that might also be the io-streams buffers doing the same thing.

It still uses a lot of memory :(, but I think has solved the lazyness/io issue

Upvotes: 1

Related Questions