vincent-lg
vincent-lg

Reputation: 559

How to handle saving large data in a distributed environment

In Elixir (and Erlang), we are encouraged to use processes to divide the work and have them communicate with small messages. Somehow though, I also need to handle not-so-small data which might not only be useful in a single process, and I'm unsure how. Here's my use case:

I've designed a simple card game which allows multiple players to join the same game through their browser, but also to create new ones. Basically, I'm keeping the card game in a process (so I create a new process whenever a player asks to create a new game). I would also like my processes to somehow save the card game on disk (or whatever storage is available). My first reaction was to avoid doing this in the game process itself, so it wouldn't "slow down" my game too much, since while the disk is accessed to write the game down (and serialization has to occur), messages sent by players to the game will be delayed. So I thought I'd create "save" processes whose job was simple: to handle the card game for a given game and store it on disk. These processes would be servers reacting to casts, so that the game process could just hand over the data whenever an action occurred (here: that's my card game, save it and well, I'm off). And now arises another problem: the card game has to be sent over the network (which might be a bit long, if on a different node). This might slow down process communication. In fact, it might also slow down the heartbeat of individual processes.

My games aren't that large. At current testing, they weigh about 4k. And yet, 4k of data might be a lot on a slow network. (Don't take network speed for granted.) I don't think I really have to worry, in my situation (I could actually save the game directly in the game process and save the trouble, it won't slow down my game that much) but I'm interested in solutions and I'm coming up blank.

The advantage of "save" processes was that they could live on another node: if the game process crashed, one would be recreated dynamically and ask save processes if anyone had the copy of game ID 121. If the save process crashed, the game processes could send their updated copy to another process/node. It seemed like a good way to keep things in good state. Of course, having a game process and save process crash at the same time would ruin some data, but there's so much one can do in a distributed environment (or any environment, for that matter). Plus, in this scenario, communication between the node(s) hosting the games (it can be spread on several nodes) and the node(s) saving data wouldn't have to be particularly fast, since the only communication would be one-sided and incremental (unless an error occurred, as described).

This is more a theoretical question. Elixir (or Erlang) isn't the only way to create distributed system, though the large message system and heartbeat might be different. Still, I would like to hear thoughts on ways to improve my system to handle data saving.

Thanks for your answers,

Upvotes: 1

Views: 67

Answers (1)

Srijan Choudhary
Srijan Choudhary

Reputation: 480

I think the main issue here is how to save large data without blocking and without causing a backlog.

Blocking can happen if the main process also does the saving, but if it hands it off to a separate process, that can cause a backlog and possible data loss in case of crash.

The best forward for this I can think of is to not save the whole state every time, but save each mutation to the game state as an individual event, and have some logic to recreate the state from individual events when trying to restore state.

To optimize this further, the "save process" can also periodically dump the whole state, so that the max number of entries to roll up on recovery is limited.

What I described here is a very basic version of how many databases write transactions first to an append-only log file, and roll it up in batches later.

Upvotes: 0

Related Questions