Snowball
Snowball

Reputation: 11656

Advice defining a data structure in Haskell

I'm having trouble modeling a data structure in Haskell. Suppose I'm running an an animal research facility and I want to keep track of my rats. I want to track the assignment of the rats to cages and to experiments. I also want to keep track of the weight of my rats, the volume of my cages, and keep notes on my experiments.

In SQL, I might do:

create table cages (id integer primary key, volume double);
create table experiments (id integer primary key, notes text)
create table rats (
    weight double,
    cage_id integer references cages (id),
    experiment_id integer references experiments (id)
);

(I realize that this allows me to assign two rats from different experiments to the same cage. That is intended. I don't actually run an animal research facility.)

Two operations that must be possible: (1) given a rat, find the volume of its cage and (2) given a rat, get the notes for the experiment it belongs to.

In SQL, those would be

select cages.volume from rats
  inner join cages on cages.id = rats.cage_id
  where rats.id = ...; -- (1)
select experiments.notes from rats
  inner join experiments on experiments.id = rats.experiment_id
  where rats.id = ...; -- (2)

How might I model this data structure in Haskell?


One way to do it is

type Weight = Double
type Volume = Double

data Rat = Rat Cage Experiment Weight
data Cage = Cage Volume
data Experiment = Experiment String

data ResearchFacility = ResearchFacility [Rat]

ratCageVolume :: Rat -> Volume
ratCageVolume (Rat (Cage volume) _ _) = volume

ratExperimentNotes :: Rat -> String
ratExperimentNotes (Rat _ (Experiment notes) _) = notes

But wouldn't this structure introduce a bunch of copies of the Cages and Experiments? Or should I just not worry about it and hope the optimizer takes care of that?

Upvotes: 17

Views: 2088

Answers (4)

Luis Casillas
Luis Casillas

Reputation: 30227

First observation: you should learn to use records. Record field names in Haskell are treated as functions, so these definitions will at least have you typing less:

data Rat = Rat { getCage       :: Cage
               , getExperiment :: Experiment
               , getWeight     :: Weight }

data Cage = Cage { getVolume :: Volume }

-- Now this function is so trivial to define that you might actually not bother:
ratCageVolume :: Rat -> Volume
ratCageVolume = getVolume . getCage

And as to data representation, I might go somewhere along these lines:

type Weight = Double
type Volume = Double

-- Rats and Cages have identity that goes beyond their properties;
-- two distinct rats of the same weight can be in the same cage, and
-- two cages can have same volume.
-- 
-- So should we give each Rat and Cage an additional field to
-- represent its key?  We could do that, or we could abstract that out
-- into this:

data Identity i a = Identity { getId  :: i
                             , getVal :: a }
            deriving Show

instance Eq i => Eq (Identity i a) where
    a == b = getId a == getId b

instance Ord i => Ord (Identity i a) where
    a `compare` b = getId a `compare` getId b


-- And to simplify a common case:
type Id a = Identity Int a


-- Rats' only real intrinsic property is their weight.  Cage and Experiment?
-- Situational, I say.
data Rat = Rat { getWeight :: Weight  }

data Cage = Cage { getVolume :: Volume }

data Experiment = Experiment { getNotes :: String }
                  deriving (Eq, Show)

-- The data that you're manipulating is really this:
type RatData = (Id Rat, Id Cage, Id Experiment)

type ResearchFacility = [RatData]

Upvotes: 4

mightybyte
mightybyte

Reputation: 7272

I use Haskell most of the time in my day job and I have encountered this issue. My experience is that it's not so much an issue of how many copies of the data structures are created, and more about the data dependencies involved. We were using similar data structures to help interface with the relational database where the actual data is stored. That means that we had queries like this.

getCageById       :: IdType -> IO (Maybe Cage)
getRatById        :: IdType -> IO (Maybe Rat)
getExperimentById :: IdType -> IO (Maybe Experiment)

We started out with our data structures built like yours with the linked data structures contained within them. This turned out to be a huge mistake. The problem is that if you use the following definition for Rat...

data Rat = Rat Cage Experiment Weight

...then the getRatById function has to run three database queries in order to return a result. This seemed like a nice convenient way of doing things at first, but it ended up being a huge performance problem, especially if we wanted a query to return a bunch of results. The data structure FORCES us to do the join even if we only want the row from the rat table. The extra database queries are the problem, not the potential for extra objects in RAM.

Now our policy is that when we're making data structures that are meant to correspond to database tables, we always denormalize them just like the tables. So your example would become something like this:

type IdType = Int
type Weight = Double
type Volume = Double

data Rat = Rat
    { ratId        :: IdType
    , cageId       :: IdType
    , experimentId :: IdType
    , weight       :: Weight
    }
data Cage = Cage IdType Volume
data Experiment = Experiment IdType String

(You might even want to use newtypes to distinguish between different IDs.) It's more work to get the whole structure, but it allows you to get at parts of the structure efficiently. Of course, if you never need to get individual parts of the structure, then my advice might not be appropriate. But my experience is that partial queries are very common and I don't want to make them artificially slow. If you want the convenience of a function that does the join for you, then you can certainly write one. But don't use a data model that locks you into this pattern of use.

Upvotes: 3

Daniel Wagner
Daniel Wagner

Reputation: 152682

Here's a short file I used for testing:

type Weight = Double
type Volume = Double

data Rat = Rat Cage Experiment Weight deriving (Eq, Ord, Show, Read)
data Cage = Cage Volume               deriving (Eq, Ord, Show, Read)
data Experiment = Experiment String   deriving (Eq, Ord, Show, Read)

volume     = 30
name       = "foo"
weight     = 15
cage       = Cage volume
experiment = Experiment name
rat        = Rat cage experiment weight

Then I started ghci and imported System.Vacuum.Cairo, available from the delightful vacuum-cairo package.

*Main System.Vacuum.Cairo> view (rat, Rat (Cage 30) (Experiment "foo") 15)

not-shared

*Main System.Vacuum.Cairo> view (rat, Rat (Cage 30) experiment 15)

shared-experiment

(I'm not really sure why there's doubled-up arrows in this one, but you can ignore/collapse them.)

*Main System.Vacuum.Cairo> view (rat, Rat cage experiment weight)

shared-args

*Main System.Vacuum.Cairo> view (rat, rat)

shared-all

*Main System.Vacuum.Cairo> view (rat, Rat cage experiment (weight+1))

shared-modified

The rule of thumb, as should be illustrated above, is that new objects are created exactly when you call a constructor; otherwise, if you just name an already-created object, no new object is created. This is a safe thing to do in Haskell because it is an immutable language.

Upvotes: 16

Daniel
Daniel

Reputation: 27549

A more natural Haskell representation of your model would be for the cages to contain the actual rat objects instead of their ids:

data Rat = Rat RatId Weight
data Cage = Cage [Rat] Volume
data Experiment = Experiment [Rat] String

Then you would create ResearchFacility objects using a smart constructor to make sure they follow the rules. It can look something like:

research_facility :: [Rat] -> Map Rat Cage -> Map Rat Experiment -> ResearchFacility
research_facility rats cage_assign experiment_assign = ...

where the cage_assign and experiment_assign are maps which contain the same information as the cage_id and experiment_id foreign keys in sql.

Upvotes: 6

Related Questions