Reputation: 233
I am trying to use hash tables in Haskell with the hashtables
package, and finding that I cannot get anywhere near Python's performance. How can I achieve similar performance? Is it possible given current Haskell libraries and compilers? If not, what's the underlying issue?
Here is my Python code:
y = {}
for x in xrange(10000000):
y[x] = x
print y[100]
Here's my corresponding Haskell code:
import qualified Data.HashTable.IO as H
import Control.Monad
main = do
y <- H.new :: IO (H.CuckooHashTable Int Int)
forM_ [1..10000000] $ \x -> H.insert y x x
H.lookup y 100 >>= print
Here is another version using Data.Map
, which is slower than both for me:
import qualified Data.Map as Map
import Data.List
import Control.Monad
main = do
let m = foldl' (\m x -> Map.insert x x m) Map.empty [1..10000000]
print $ Map.lookup 100 m
Interestingly enough, Data.HashMap
performs very badly:
import qualified Data.HashMap.Strict as Map
import Data.List
main = do
let m = foldl' (\m x -> Map.insert x x m) Map.empty [1..10000000]
print $ Map.lookup 100 m
My suspicion is that Data.HashMap
performs badly because unlike Data.Map
, it is not spine-strict (I think), so foldl'
is just a foldl
, with the associated thunk buildup problems.
Note that I have used -prof
and verified that the majority of the time is spend in the hashtables
or Data.Map
code, not on the forM
or anything like that. All code is compiled with -O2
and no other parameters.
Upvotes: 8
Views: 3578
Reputation:
As reddit.com/u/cheecheeo suggested here, using Data.Judy, you'll get similar performance for your particular microbenchmark:
module Main where
import qualified Data.Judy as J
import Control.Monad (forM_)
main = do
h <- J.new :: IO (J.JudyL Int)
forM_ [0..10000000] $ \i -> J.insert (fromIntegral i) i h
v <- J.lookup 100 h
putStrLn $ show v
Timeing the above:
$ time ./Main
Just 100
real 0m0.958s
user 0m0.924s
sys 0m0.032s
Timing the python code of OP:
$ time ./main.py
100
real 0m1.067s
user 0m0.886s
sys 0m0.180s
Upvotes: 11
Reputation: 12070
Given the times above, I thought I would throw in the Data.Map
solution, which seems to be comparable to using newSized.
import qualified Data.Map as M
main = do
print $ M.lookup 100 $ M.fromList $ map (\x -> (x,x)) [1..10000000]
Upvotes: 2
Reputation: 5992
The documentation for hashtables
notes that "Cuckoo hashing, like the basic hash table implementation using linear probing, can suffer from long delays when the table is resized." You use new
, which creates a new table of the default size. From looking at the source, it appears that the default size is 2. Inserting 10000000 items likely entails numerous resizings.
Try using newSized
.
Upvotes: 9