Reputation: 33

Managing Entity Resolution in Anchor Modeling

I've been reading about anchor modeling and really like the concept. My hope is to possibly incorporate it into a data management framework where I consolidate multiple data sources into an anchor model, then either make it directly available or have it feed data marts for our data scientists.

But I'm not sure how to approach entity resolution. The guidelines state no updates, only inserts, with the option to delete only to remove erroneous data. So now lets say my source system(s) have duplicate entities (eg. John Smith appears more than once), and this makes its way into my anchor model? What is the best way to clean this up?

My rubber duck is telling me to create an entity resolution layer on top of my anchor model that looks for these issues and corrects them. Correcting would mean merging entities in anchors and fixing subsequent ties accordingly. But now I'm updating my anchor model...which is against best practices.

Or am I looking at this wrong....and entity resolution should be dealt with before data gets into the anchor model? But mistakes can happen, and it would be nice to know I could address the issue inside the anchor model should it present itself.

Upvotes: 0

Answers (1)

Dominik Kierner

Reputation: 13

Entity Resolution and Duplicates in Anchor Models

An anchor model can help you with deduplication and entity resolution, but you'll likely need at least some form of input preparation to enter it correctly into the database. It largely depends on which kind of duplicates you have and the temporalization model you use.

For simpclity I will assume that you use a unitemporal anchor model.

Anchor Modeling Basics: Transitional Modeling Theory

For understanding the semantics of anchor models better, I recommend you to read Rönnbäck's paper "Modeling Conflicting, Unreliable, and Varying Information" (DOI: 10.13140/rg.2.2.34381.49121/1) on the transitional modeling theory which forms the basis of anchor modeling and its three temporalization/historization models: unitemporal (versioning), bitemporal (versioning and corrections) and concurrent-reliance temporal (multitemporality with reliances of statements). Rönnbäck's presentation "Posits and Assertions" is also good at explaining these details.

Page 14 of the paper goes into the specific theory and the definitions froms page 26 to 27 from the presentation explain it quite nicely:

Retraction is important for concurrent-reliance temporal models which use reliances. For all other cases, you just insert a new value. Though there are model features you can use.

Model Features for Entity Resolution and Deduplication

Model Setting: Equivalence

For data with equivalent meaning, you could use the equivalence setting in the menu "Generate", which, according to the modeler, is "useful for multitenancy and multilingualism". This probably comes closest to your problem.

Depending on how it's implemented, this might be a better solution than your batch key.

Attribute Settings

Restatements

For attributes you can disable restatements by enabling them with the setting "Restateable" in the anchor modeler. For explanation: A restatement is, when two values over time are identical and only the assertion time differs.

Idempotency

If you can ensure, that your data arrives synchronously, you can also use the idempotency attribute setting to only record values, when they are changed. For data arriving asynchronously in regards to changing time, idempotency is not recommended.

Updating Anchor Models

When updating anchor models, always use the modeler, otherwise you might end up changing the schema in way, in which it can't be updated correctly anymore. Anchor models use a lot of trigger logic for sanity and consistency checks, which will be the first to break if you mess something up and I wouldn't touch an anchor model with any manual operation that changes the schema.

You can update anchor models in place and apply the generated SQL to an existing anchor model database, if you didn't know, as the generated SQL code is idempotent and will only update changed elements. Every previous schema is kept as a subset of the latest schema, which avoids costly schema maintenance and is one of the main reasons anchor modeling was developed in the first place.

Upvotes: 1