What Erlang data structure to use for ordered set with the possibility to do lookups?

Question

I am working on a problem where I need to remember the order of events I receive but also I need to lookup the event based on it's id. How can I do this efficiently in Erlang if possible without a third party library? Note that I have many potentially ephemeral actors with each their own events (already considered mnesia but it requires atoms for the tables and the tables would stick around if my actor died).

-record(event, {id, timestamp, type, data}).

zxq9 · Accepted Answer

Based on the details included in the discussion in comments on Michael's answer, a very simple, workable approach would be to create a tuple in your process state variable that stores the order of events separately from the K-V store of events.

Consider:

%%% Some type definitions so we know exactly what we're dealing with.
-type id()     :: term().
-type type()   :: atom().
-type data()   :: term().
-type ts()     :: calendar:datetime().
-type event()  :: {id(), ts(), type(), data()}.
-type events() :: dict:dict(id(), {type(), data(), ts()}).

% State record for the process.
% Should include whatever else the process deals with.
-record(s,
        {log    :: [id()],
         events :: event_store()}).

%%% Interface functions we will expose over this module.
-spec lookup(pid(), id()) -> {ok, event()} | error.
lookup(Pid, ID) ->
    gen_server:call(Pid, {lookup, ID}).

-spec latest(pid()) -> {ok, event()} | error.
latest(Pid) ->
    gen_server:call(Pid, get_latest).

-spec notify(pid(), event()) -> ok.
notify(Pid, Event) ->
    gen_server:cast(Pid, {new, Event}).

%%% gen_server handlers
handle_call({lookup, ID}, State#s{events = Events}) ->
    Result = find(ID, Events),
    {reply, Result, State};
handle_call(get_latest, State#s{log = [Last | _], events = Events}) ->
    Result = find(Last, Events),
    {reply, Result, State};
% ... and so on...

handle_cast({new, Event}, State) ->
    {ok, NewState} = catalog(Event, State),
    {noreply, NewState};
% ...

%%% Implementation functions
find(ID, Events) ->
    case dict:find(ID, Events) of
        {Type, Data, Timestamp} -> {ok, {ID, Timestamp, Type, Data}};
        Error                   -> Error
    end.

catalog({ID, Timestamp, Type, Data},
        State#s{log = Log, events = Events}) ->
    NewEvents = dict:store(ID, {Type, Data, Timestamp}, Events),
    NewLog = [ID | Log],
    {ok, State#s{log = NewLog, events = NewEvents}}.

This is a completely straightforward implementation and hides the details of the data structure behind the interface of the process. Why did I pick a dict? Just because (its easy). Without knowing your requirements better I really have no reason to pick a dict over a map over a gb_tree, etc. If you have relatively small data (hundreds or thousands of things to store) the performance isn't usually noticeably different among these structures.

The important thing is that you clearly identify what messages this process should respond to and then force yourself to stick to it elsewhere in your project code by creating an interface of exposed functions over this module. Behind that you can swap out the dict for something else. If you really only need the latest event ID and won't ever need to pull the Nth event from the sequence log then you could ditch the log and just keep the last event's ID in the record instead of a list.

So get something very simple like this working first, then determine if it actually suits your need. If it doesn't then tweak it. If this works for now, just run with it -- don't obsess over performance or storage (until you are really forced to).

If you find later on that you have a performance problem switch out the dict and list for something else -- maybe gb_tree or orddict or ETS or whatever. The point is to get something working right now so you have a base from which to evaluate the functionality and run benchmarks if necessary. (The vast majority of the time, though, I find that whatever I start out with as a specced prototype turns out to be very close to whatever the final solution will be.)

What Erlang data structure to use for ordered set with the possibility to do lookups?

Answers (2)

Related Questions