Reputation: 153
For numerous reasons you'll want to know which column(s) make the table unique. This is an easy job to do when you have tables with an actual PK or just a few columns.
However in this case I got a lot of huge tables (50+ columns) without any data types specified and thus no keys. There are some options to get this information; parse the table with a specific tool or simply ask the owner/creator of the table what the composite key is. This can prove to be a slow process. I want to see if I can determine what the composite key is for a table with a stored procedure.
I know for sure there has to be one for each table. This is because every row in each table is unique.
I've spend some time on google on this matter but I only come across ways to make composite keys etc. Or ways to list tables with composite keys (only if they are already defined).
Basicly what I'm trying to create is a SP which takes a table name and returns me the columns which I could turn into a composite key.
Can anyone point me in the right direction because I'm kind of lost at the moment. I'm using SQL server 2014 in this project.
Upvotes: 0
Views: 343
Reputation: 131581
This is a common problem in ETL projects - you get a file, table or other data source and have to guess about unique columns, sizes, distinct values, relations between columns, nullability etc.
SSIS already has a Data Profiling Task that can read a source (eg a table) and perform many of these checks. The one you need is Candidate Key Profile
. This will check which column combinations can be used as keys or even approximate keys (ie not 100% unique).
This process takes a long time because the task has to read all the data and check all column combinations for uniqueness. The result is saved in a file that can be opened in the Data Profile Viewer for analysis.
You can use multiple profiles at once to save time, eg to collect size, distribution and candidate keys in one go.
You can also use multiple profiling tasks to analyze multiple sources/tables at once, or put it in a loop container that profiles eg all files in a folder or all tables in a list.
Upvotes: 1
Reputation: 1098
It's not most appropriate task for stored procedure, but with time it's doable. Unfortunately, straightforward complete solution takes quite a bit of time, too much, I'd say. Time complexity is about O(2^C), where C is number of columns, and each try takes quite a bit of time as well.
The general idea for complete solution is to iterate over column sets, every possible combination of table columns (in no particular order inside set). Then we check set for coupling, and get one with least columns from ones passing the test.
Good news, there is quick test for possible solution. Check for count(distinct)=count(*). If it's ok, you can at least use whole table as one bit key...
Bad news, it's doesn't solve the problem.
But we can be greedy.
Count unique values for each column and sort columns in decreasing order. If there is simple 1-column PK, it will be in first column(s).
Now we get columns in this order and add to set of columns. If multiplication of unique values for selected column 1..K is less than count(*), you can't get unique index out of them - go further.
Once we count enough possibilities, try to check count(distinct) for select over these columns with dynamic SQL. If it's not enough, go further.
Once we get enough columns to cover uniqness, we are nearly done. Now let's try to remove some columns from this set. Iterate from 1 to K and try column set (1..I-1, I+1..K). If it's still unique, remove this column from set and continue dropping columns.
A bit tricky to continue iteration properly after removal of column, but it's doable.
Once we tried to remove all columns of covering set, we are done.
It is not precise algorythm, so you'll have to check it manually, but at least you'll have something to start with and it has O(C*D) time complexity where C is number of columns and D is amount of data in the table.
Upvotes: 2
Reputation: 130
I feel like you just need an algorithm for it. If you do not have a PK, you could use the columns as key suspects that are used in indices. You have to make sure your composite key is unique tablewise so have to query the data. Here is what I thought: 0. query the row count of the table 1. choose a column that is suspicious to be a unique key 2. make a dictinct count query of that column for example top 1000 rows 3. if the result is 1000, the data is unique. in this case repeat the query with a greater number of rows, until you reach the row count of the table. if the result of the last query equals to the row count of the table you found your probably unique key. 4. if the result is lower than 1000, replenish the chosen column with another suspect making a composite key. with this key repeat from step 2.
This can be a slow and tedious work and you should consider that even with this, you cannot be sure you found the true unique key, just a key that is unique with the already existing data.
Upvotes: 0