mb14
mb14

Reputation: 22596

Effective way to build a (Lazy) Map in Haskell

I know a few different a way to build a Map in Haskell :

  1. build it from a list using fromList
  2. build it from a sorted list using fromAscList
  3. Use the fact that Map is a Monoid (or a Semigroup) and concat singletons.

I understand that the amortized complexity of #1 is O(n*log(n)) where as #2 is O(n). I Guess #3 should be roughly equivalent to #1 and might be subject to fusion.

The amortized is important, because Haskell being lazy by default, even though the lookup from a Map is O(log(n)), it can be in practice interleaved with the construction of the Map itself which is O(n * log(n)), which can make in practice the lookup being O(n * log(n)) (especially if you are building the map each time you need it). This also might append if you use hardcoded Map

For example, am I right to think that lookup 'b' (fromList [('a', 1), ('b', 2)]) is actually equivalent to just to d lookup in the list without using a intermediate Map ?

So is there a difference between #1 and #3 , or sorting the list and the calling fromList ?

Update

Also, If I need a map to be only computed once, do I need to make sure GHC doesn't inline it, so it is shared between functions ?

Use case

I realized that the question might be a bit blur and in fact correspond to different use cased I encountered recently.

The first one correspond to a "static join". I have an app which manages items and each item code can be split in style and variation (For example 'T-Shirt-Red' => ('T-Shirt', 'Red'). The split is based on rules (and regexp) and is quite slow. To not recompute the rules all the times, this is done once and stored in a db table. I have a few pure function which need to be able to split an item code so I pass them a function Text -> (Text, Text). The function is actually a lookup partially applied on a Map. The code is similar to this

getSplitter :: Handler (Text -> (Text, Text))
getSplitter = do
      sku'style'vars <- runDB $ rawSQL "SELECT sku, style, var FROM style_cache " [] -- load the split table
      let sku'map = fromList [ (sku, (style,var)) 
                             | (sku, style, var) <- sku'style'vars
                             ]
      return $ flip lookup sku'map

This one can be easily speed up by sorting the item by sku and using fromDistinctAscList (which is actually faster than fromAscList). However I still have some issue about how to cache it between different request.

The second case is to do a manual join between two tables, I usually do something along

do
   sku'infos <- selectList [] [] -- load item info
   let skuInfo = fromList sku'infos

   orderLines <- selectList [] [] -- load orders
   -- find infos corresponding to order items
   mapM (\o -> (o, lookup (orderSku o) skuInfo) orderLines

There again I can sort sku'infos in SQL and use fromDistinctAscList.

A third case is when fetching miscellaneous info related to an category item in different table.

For example I want to be able to compare the sales (sales table) and the purchase (purchases table) by category.

In pure SQL I would do something along

SELECT style, sum(sales.amount), sum(purchase.amount)
FROM style_cache
LEFT JOIN sales USING(sku)
LEFT JOIN purchases USING(sku)
GROUP by style

However, this is a simplified example and in practice, the aggregation is much more complicated and have to be done in Haskell as well as the join. To do so I'm loading each table separately (grouping what I can in sql) and return a Map Style SalesInfo, Map Style PurchaseInfo etc ... and merge them. The table are quite big and I realize I end up loading everything in memory whereas I could probably by "zipping" things manually be much more efficient but I'm not sure how.

Upvotes: 5

Views: 524

Answers (1)

Daniel Wagner
Daniel Wagner

Reputation: 152707

I'm not sure I understood the entire motivation behind this question, but I'll make a few comments:

  1. Map is spine-strict -- which means the tree structure of a Map and the keys themselves are forced (at least far enough to do all the requisite comparisons) on every Map operation. So I would expect Data.Map.lookup k (fromList xs) to take O(n*log(n)) comparisons (n the length of xs) whereas I would expect Prelude.lookup k xs to take O(n) comparisons (actually just equality checks, but usually that's pretty much the same complexity as a comparison).
  2. If fromAscList . sort is reliably faster than fromList, this is a performance bug in Data.Map and the library should just be changed to define fromList = fromAscList . sort. I would be very surprised if this were the case. People have spent a fair bit of time optimizing containers, so I wouldn't expect to see any fruit hanging as low as that.
  3. Yes, inlining breaks sharing.

Upvotes: 2

Related Questions