Andrew Gibiansky
Andrew Gibiansky

Reputation: 233

Haskell Hashtable Performance

I am trying to use hash tables in Haskell with the hashtables package, and finding that I cannot get anywhere near Python's performance. How can I achieve similar performance? Is it possible given current Haskell libraries and compilers? If not, what's the underlying issue?

Here is my Python code:

y = {}
for x in xrange(10000000):
    y[x] = x
print y[100]

Here's my corresponding Haskell code:

import qualified Data.HashTable.IO as H
import Control.Monad

main = do
  y <- H.new :: IO (H.CuckooHashTable Int Int)
  forM_ [1..10000000] $ \x -> H.insert y x x
  H.lookup y 100 >>= print

Here is another version using Data.Map, which is slower than both for me:

import qualified Data.Map as Map
import Data.List
import Control.Monad
main = do
  let m = foldl' (\m x -> Map.insert x x m) Map.empty [1..10000000]
  print $ Map.lookup 100 m

Interestingly enough, Data.HashMap performs very badly:

import qualified Data.HashMap.Strict as Map
import Data.List
main = do
  let m = foldl' (\m x -> Map.insert x x m) Map.empty [1..10000000]
  print $ Map.lookup 100 m

My suspicion is that Data.HashMap performs badly because unlike Data.Map, it is not spine-strict (I think), so foldl' is just a foldl, with the associated thunk buildup problems.

Note that I have used -prof and verified that the majority of the time is spend in the hashtables or Data.Map code, not on the forM or anything like that. All code is compiled with -O2 and no other parameters.

Upvotes: 8

Views: 3578

Answers (3)

user4222027
user4222027

Reputation:

As reddit.com/u/cheecheeo suggested here, using Data.Judy, you'll get similar performance for your particular microbenchmark:

module Main where

import qualified Data.Judy as J
import Control.Monad (forM_)

main = do
    h <- J.new :: IO (J.JudyL Int)
    forM_ [0..10000000] $ \i -> J.insert (fromIntegral i) i h
    v <- J.lookup 100 h
    putStrLn $ show v

Timeing the above:

$ time ./Main
Just 100

real    0m0.958s
user    0m0.924s
sys     0m0.032s

Timing the python code of OP:

$ time ./main.py
100

real    0m1.067s
user    0m0.886s
sys     0m0.180s

Upvotes: 11

jamshidh
jamshidh

Reputation: 12070

Given the times above, I thought I would throw in the Data.Map solution, which seems to be comparable to using newSized.

import qualified Data.Map as M

main = do
       print $ M.lookup 100 $ M.fromList $ map (\x -> (x,x)) [1..10000000]

Upvotes: 2

Christian Conkle
Christian Conkle

Reputation: 5992

The documentation for hashtables notes that "Cuckoo hashing, like the basic hash table implementation using linear probing, can suffer from long delays when the table is resized." You use new, which creates a new table of the default size. From looking at the source, it appears that the default size is 2. Inserting 10000000 items likely entails numerous resizings.

Try using newSized.

Upvotes: 9

Related Questions