Starfish
Starfish

Reputation: 707

How to generate ERD/UML for a database which has lost its primary key and foreign key information?

I have with me the table creation scripts of a well structured database. BUT, For some reason lets say, the table creation scripts does not contain the primary key or foreign key information.

What is the best way to generate a ERD/UML for the database created using these scripts - assuming the indexes are intact. Is there any tool available to check the names of columns and generate the ERD/UML?

Should I do this manually?

Upvotes: 3

Views: 101

Answers (2)

Walter Mitty
Walter Mitty

Reputation: 18940

The manual part of the task is basically business analysis from a data centric point of view. This is time consuming and tedious. Get as much help as you can from subject matter experts. Discover the business rules, and the entities that the subject matter experts deal with all the time. Discover how they identify instances of these entitites.

The automatable part of this task has been outlined by TAM in his answer.

Upvotes: 0

TAM
TAM

Reputation: 1741

The answer to your question - should I do this manually? - is yes, I'm afraid.

Let's think about the possibilities for automatic detection of primary keys and foreign keys. No matter whether such an algorithm already exists or would be written by yourself, it would have to determine, from a given database scheme, which combinations of attributes qualify for PK and FK.

There is an easy case, but I suspect this is not true for your database, otherwise you wouldn't ask the question. If there were name conventions (which are very recommendable), e.g. calling each PK just id and each FK <referenced table>_id, then it would be clear from the syntactic properties of the table definitions where the PKs and FKs are. But then your database definition would answer your question by itself.

Another way is semantics of the tables and their attributes. E.g., if you have a person table that contains a place_of_birth attribute, and only one table called city that could possibly contain such places, then most probably you have found an FK. But that kind of reasoning would be very hard to automate, as it would require a lot of knowlege of the domain underlying your database, and the meaning and relationships of terms of that domain.

The remaining algorithmic possibility is grounded on the formal definition of database keys. A key is an attribute or combination of attributes that has unique values for each record in the table. Unfortunately, this is an extensional definition, i.e. it refers to the database content you have at a specific point in time. Assume, e.g., that you have a table person with attributes first_name and last_name. You could find or write an algorithm that detects whether one or more columns qualify as key candidates, and it might today determine that first and last name qualify, as there are no two records in the table sharing the same value pair for them. But tomorrow, a person called Jack Miller could come along to be inserted in your database where another Jack Miller is also present, and would invalidate your PK assumption.

So, a stable notion of key must be accompanied by an intensional argument, a domain law that makes sure that a certain combination of keys is and will stay unique for each domain object.

Often, this kind of difficulty is avoided by introducing an artificial key, like an id. While this can be recommended, I'm afraid your database doesn't follow this too strictly, otherwise you would have noticed it and it would have solved your problem at least partially.

Coming to FKs, a similar difficulty arises. Syntactically, every combination of attributes in a table t1 whose value combinations are a subset of a combination of attributes in another table t2 might be an FK to that table. But even in the simple case where you already have determined primary keys, maybe for some semantically clear tables containing reference data, how should an algorithm tell, for a potential FK containg values from 1 to 100, which of the tables having keys from 1 to (at least) 100 is referenced?

So I'm afraid the main task will be manual, or, better said, performed in the brain using some domain knowledge. Before applying bottom-up approaches, analyzing ranges of value combinations, there should come some top-down reasoning. Which are the main business objects of your domain, how could they be identified, and how do they need to be related to each other? Once you have found an hypothesis, then the algorithmic verification is easy. To falsify a key candidate, perform a query like

select key_candidate, count(*) as ct
from table
group by key_candidate
having ct > 1

and if it yields more than zero records, then key_candidate can't be a key. To falsify a foreign key, perform

select *
from table t
where not exists (
   select *
   from referenced_table rt
   where t.fk_candidate = rt.key
)

and if it yields more than zero records, then fk_candidate is not an FK or key is not the intended key to be referenced. But those algorithmic arguments are only a supplement to your semantic studies.

Upvotes: 1

Related Questions