Mikhas
Mikhas

Reputation: 882

Snapshot taking and restore strategies

I've been reading about CQRS+EventSoucing patterns (which I wish to apply in a near future) and one point common to all decks and presentations I found is to take snapshots of your model state in order to restore it, but none of these share patterns/strategies of doing that.

I wonder if you could share your thoughts and experience in this matter particularly in terms of:

TL;DR: How have you implemented Snapshotting in your CQRS+EventSourcing application? Pros and Cons?

Upvotes: 24

Views: 8656

Answers (2)

VoiceOfUnreason
VoiceOfUnreason

Reputation: 57287

  • Rule #1: Don't.
  • Rule #2: Don't.

Snapshotting an event sourced model is a performance optimization. The first rule of performance optimization? Don't.

Specifically, snapshotting reduces the amount of time you lose in your repository trying to reload the history of your model from your event store.

If your repository can keep the model in memory, then you aren't going to be reloading it very often. So the win from snapshotting will be small. Therefore: don't.

If you can decompose your model into aggregates, which is to say that you can decompose the history of your model into a number of entities that have non-overlapping histories, then your one model long model history becomes many many short histories that each describe the changes to a single entity. Each entity history that you need to load will be pretty short, so the win from a snapshot will be small. Therefore: don't.

The kind of systems I'm working today require high performance but not 24x7 availability. So in a situation where I shut down my system for maintenace and restart it I'd have to load and reprocess all my event store as my fresh system doesn't know which aggregate ids to process the events. I need a better starting point for my systems to restart be more efficient.

You are worried about missing a write SLA when the repository memory caches are cold, and you have long model histories with lots of events to reload. Bolting on snapshotting might be a lot more reasonable than trying to refactor your model history into smaller streams. OK....

The snapshot store is a read model -- at any point in time, you should be able to blow away the model and rebuild it from the persisted history in the event store.

From the perspective of the repository, the snapshot store is a cache; if no snapshot is available, or if the store itself doesn't respond within the SLA, you want to fall back to reprocessing the entire event history, starting from the initial seed state.

The service provider interface is going to look something like

interface SnapshotClient {
    SnapshotRecord getSnapshot(Identifier id)
}

SnapshotRecord is going to provide to the repository the information it needs to consume the snapshot. That's going to include at a minimum

  1. a memento that allows the repository to rehydrate the snapshotted state
  2. a description of the last event processed by the snapshot projector when building the snapshot.

The model will then re-hydrate the snapshotted state from the memento, load the history from the event store, scanning backwards (ie, starting from the most recent event) looking for the event documented in the SnapshotRecord, then apply the subsequent events in order.

The SnapshotRepository itself could be a key-value store (at most one record for any given id), but a relational database with blob support will work fine too

select * 
from snapshots s 
where id = ? 
order by s.total_events desc 
limit 1

The snapshot projector and the repository are tightly coupled -- they need to agree on what the state of the entity should be for all possible histories, they need to agree how to de/re-hydrate the memento, and they need to agree which id will be used to locate the snapshot.

The tight coupling also means that you don't need to worry particularly about the schema for the memento; a byte array will be fine.

They don't, however, need to agree with previous incarnations of themselves. Snapshot Projector 2.0 discards/ignores any snapshots left behind by Snapshot Projector 1.0 -- the snapshot store is just a cache after all.

i'm designing an application that will probably generate millions event a day. what can we do if we need to rebuild a view 6 month later

One of the more compelling answers here is to model time explicitly. Do you have one entity that lives for six months, or do you have 180+ entities that each live for one day? Accounting is a good domain to reference here: at the end of the fiscal year, the books are closed, and the next year's books are opened with the carryover.

Yves Reynhout frequently talks about modeling time and scheduling; Evolving a Model may be a good starting point.

Upvotes: 26

Charles
Charles

Reputation: 3774

There are few instances you need to snapshot for sure. But there are a couple - a common example is an account in a ledger. You'll have thousands maybe millions of credit/debit events producing the final BALANCE state of the account - it would be insane not to snapshot that every so often.

My approach to snapshoting when I designed Aggregates.NET was its off by default and to enable your aggregates or entities must inherit from AggregateWithMemento or EntityWithMemento which in turn your entity must define a RestoreSnapshot, a TakeSnapshot and a ShouldTakeSnapshot

The decision whether to take a snapshot or not is left up to the entity itself. A common pattern is

Boolean ShouldTakeSnapshot() {
    return this.Version % 50 == 0;
}

Which of course would take a snapshot every 50 events.

When reading the entity stream the first thing we do is check for a snapshot then read the rest of the entity's stream from the moment the snapshot was taken. IE: Don't ask for the entire stream just the part we have not snapshoted.

As for the store - you can use literally anything. VOU is right though a key-value store is best because you only need to 1. check if one exists 2. load the entire thing - which is ideal for kv

For system restarts - I'm not really following what your described problem is. There's no reason for your domain server to be stateful in the sense that its doing something different at different points in time. It should do just 1 thing - process the next command. In the process of handling a command it loads data from the event store, including a snapshot, runs the command against the entity which either produces a business exception or domain events which are recorded to the store.

I think you may be trying to optimize too much with this talk of caching and cold starts.

Upvotes: 9

Related Questions