Reputation: 22596
I know a few different a way to build a Map
in Haskell :
fromList
fromAscList
Map
is a Monoid
(or a Semigroup
) and concat singletons
.I understand that the amortized complexity of #1 is O(n*log(n)) where as #2 is O(n). I Guess #3 should be roughly equivalent to #1 and might be subject to fusion.
The amortized is important, because Haskell being lazy by default, even though the lookup from a Map is O(log(n)), it can be in practice interleaved with the construction of the Map itself which is O(n * log(n)), which can make in practice the lookup being O(n * log(n)) (especially if you are building the map each time you need it). This also might append if you use hardcoded Map
For example, am I right to think that lookup 'b' (fromList [('a', 1), ('b', 2)])
is actually equivalent to just to d lookup in the list without using a intermediate Map ?
So is there a difference between #1 and #3 , or sorting the list and the calling fromList
?
Also, If I need a map to be only computed once, do I need to make sure GHC doesn't inline it, so it is shared between functions ?
I realized that the question might be a bit blur and in fact correspond to different use cased I encountered recently.
The first one correspond to a "static join". I have an app which manages items and each item code can be split in style and variation (For example 'T-Shirt-Red' => ('T-Shirt', 'Red'). The split is based on rules (and regexp) and is quite slow. To not recompute the rules all the times, this is done once and stored in a db table. I have a few pure function which need to be able to split an item code so I pass them a function Text -> (Text, Text)
. The function is actually a lookup partially applied on a Map. The code is similar to this
getSplitter :: Handler (Text -> (Text, Text))
getSplitter = do
sku'style'vars <- runDB $ rawSQL "SELECT sku, style, var FROM style_cache " [] -- load the split table
let sku'map = fromList [ (sku, (style,var))
| (sku, style, var) <- sku'style'vars
]
return $ flip lookup sku'map
This one can be easily speed up by sorting the item by sku
and using fromDistinctAscList
(which is actually faster than fromAscList
). However I still have some issue about how to cache it between different request.
The second case is to do a manual join between two tables, I usually do something along
do
sku'infos <- selectList [] [] -- load item info
let skuInfo = fromList sku'infos
orderLines <- selectList [] [] -- load orders
-- find infos corresponding to order items
mapM (\o -> (o, lookup (orderSku o) skuInfo) orderLines
There again I can sort sku'infos in SQL and use fromDistinctAscList.
A third case is when fetching miscellaneous info related to an category item in different table.
For example I want to be able to compare the sales (sales table) and the purchase (purchases table) by category.
In pure SQL I would do something along
SELECT style, sum(sales.amount), sum(purchase.amount)
FROM style_cache
LEFT JOIN sales USING(sku)
LEFT JOIN purchases USING(sku)
GROUP by style
However, this is a simplified example and in practice, the aggregation is much more complicated and have to be done in Haskell as well as the join. To do so I'm loading each table separately (grouping what I can in sql) and return a Map Style SalesInfo
, Map Style PurchaseInfo
etc ... and merge them. The table are quite big and I realize I end up loading everything in memory whereas I could probably by "zipping" things manually be much more efficient but I'm not sure how.
Upvotes: 5
Views: 524
Reputation: 152707
I'm not sure I understood the entire motivation behind this question, but I'll make a few comments:
Map
is spine-strict -- which means the tree structure of a Map
and the keys themselves are forced (at least far enough to do all the requisite comparisons) on every Map
operation. So I would expect Data.Map.lookup k (fromList xs)
to take O(n*log(n)) comparisons (n the length of xs
) whereas I would expect Prelude.lookup k xs
to take O(n) comparisons (actually just equality checks, but usually that's pretty much the same complexity as a comparison).fromAscList . sort
is reliably faster than fromList
, this is a performance bug in Data.Map
and the library should just be changed to define fromList = fromAscList . sort
. I would be very surprised if this were the case. People have spent a fair bit of time optimizing containers
, so I wouldn't expect to see any fruit hanging as low as that.Upvotes: 2