Reputation: 43330
Please consider the following data model:
data Artist = Artist Text
data Song = Song Artist Text
data Catalogue = Catalogue (Set Artist) (Set Song)
You can see that the Artist
s are referred to from both the Song
s and the Catalogue
. The Catalogue
contains a list of all artists referred to from Song
s, so the same values of Artist
get referred to from two places.
Suppose we were to generate the Catalogue
value using multiple applications of the following function:
insertSong :: Song -> Catalogue -> Catalogue
insertSong song@(Song artist title) (Catalogue artists songs) =
Catalogue (Set.insert artist artists) (Set.insert song songs)
It's evident that the Catalogue
would get filled by references to the same values of Artist
as the Song
s refer to, thus saving the memory by not storing the copies of those values.
The problem is that when I try to recreate the catalogue from serialized data by separately deserializing a set of artists and a set of songs, the application occupies way more memory than when it generated the same value of Catalogue
with insertSong
. I suspect that it is caused by the lost relation between same Artist
s referred to from Song
s and the Catalogue
, which is why I get copies of values of Artist
occupying extra memory.
The only solution I see is to first deserialize the set of artists and then to deserialize the set of songs while forcefully replacing the values of Artist
with the ones from the first set.
So my questions are:
Upvotes: 7
Views: 130
Reputation: 43330
I've accidentally stumbled upon a project, which approaches the issue. See RefSerialize.
Upvotes: 1
Reputation: 63399
A simple solution seems to cache data using a somewhat degenerated map:
{-# LANGUAGE DeriveDataTypeable, RankNTypes #-}
import Control.Monad
import Control.Monad.State
import Data.Map (Map)
import qualified Data.Map as M
type Cache a = Map a a
We can then query this cache if there is already an entry equal to this one, and replace it with the cached one:
cached :: (Ord a) => a -> State (Cache a) a
cached x = state $ \m ->
case M.lookup x m of
Just x' -> (x', m)
Nothing -> (x, M.insert x x m)
This way, if we load several equal elements of type a
, we transform them into a single. This can be done during deserialization or once at the end.
Perhaps it would be possible to generalize it further and use SYB to map all values of some given type in a data structure through a cache:
import Data.Data (Data)
import Data.Generics.Aliases (mkM)
import Data.Generics.Schemes (everywhereM)
import Data.Typeable (Typeable)
replaceFromCache
:: (Ord a, Typeable a, Data b)
=> b -> State (Cache a) b
replaceFromCache = everywhereM (mkM cached)
Then we can replace all artists in some data structure like
data Artist = Artist String
deriving (Eq, Ord, Typeable)
cacheAllArtists :: (Data b) => b -> b
cacheAllArtists b = evalState (replaceFromCache b) (M.empty :: Cache Artist)
Or we can use Proxy
phantom type to create a general version:
cacheAll :: (Ord a, Typeable a, Data b)
=> Proxy a -> b -> b
cacheAll p = flip evalState (emptyOf p) . replaceFromCache
where
emptyOf p = asTypeOf2 M.empty p
asTypeOf2 :: f a -> Proxy a -> f a
asTypeOf2 = const
cacheAllArtists :: (Data b) => b -> b
cacheAllArtists = cacheAll (Proxy :: Proxy Artist)
(Disclaimer: I haven't tested any of the above code.)
Upvotes: 3
Reputation: 38758
Note that sharing will be also lost if you do any kind of computation on your strings (i.e. even if artist1
and artist2
are the same and shared, f artist1
and f artist2
are probably not). If this becomes a problem, you can do similar changes to your data structures, too.
Upvotes: 6