Reputation: 860
(with SQL Server 2008) I have a big table (~50M records) that is fully normalized. There are 4 primary columns, and one of the them has only three possible entries- A, B, and C. The issue is, often there is much redundancy for this column. That is to say, there can be many records with value A, and then many repeated records that are identical in all respects, except with value B (and/or C). This redundancy does not always happen, but it's frequent enough that it greatly increases the record count and I wish to be rid of it.
My idea is that instead of A, B, C being choices for a column, I've thought about creating 3 bit columns titled A, B, C. Then in the case of the aforementioned redundancies for these values, I don't have to create repeated records, but instead just have one record and then flag the A, B, and/or C columnns as necessary.
These seems unorthodox so I thought I'd see what the experts think. One thing is that there would be three different uniqueness contraints for this table, each including all the other primary keys plus one of the three flag columns.
[EDIT] To clarify on the meaning of "many repeated records", one of the other PK's is a date column. So for example, there could be 1000 records of different dates with entry A, and then another 1000 records of the same dates (and other columns identical) but with entry B. So that is how even with only three choices there can still be lots of redundancy.
Upvotes: 0
Views: 844
Reputation: 432271
You can't have "many repeated records that are identical in all respects" except for the 4th column in the PK that takes one of A OR B or C. This means to me that you have at most 3 rows (over the other 3 PK columns) differentiated by either A or B or C
This means you should have one unique constraint because of this.
I'd do nothing based on this and also
Edit:
Your redundancy isn't in the ABC column. The row multiplication is caused by the datetime.
Can you change the datetime to smalldatetime and suppress near-duplicates that way? eg resolve to nearest minute not 3.33 milliseconds? Or for SQL Server 2008 use datetime2 and pick your resolution
Upvotes: 1
Reputation: 37655
Most databases will allocate a minimimum of the most efficient processing unit per field in any case so calling them bit fields would only be a metadata difference. But unpacking bits into words is just overhead anyway. You might as well use probably ints. And I'm pretty sure Sql Server doesn't index bit fields - cardinality of 2 doesn't help much.
50M records? A small number by most accounts.
Have you tried to quantify the overhead you're trying to reduce? If nothing else you're going to add work for the increased complexity.
I'd have to think a long time before increasing complexity.
Is this a really stable design otherwise, and you have some extra time?
Upvotes: 0
Reputation: 65157
The only issue I can think of off the top of my head is that you will need to change your existing code and include all 3 fields if you want to get any use out of indexing on those bit columns.
Bit fields by their nature are not very selective. To get good selectivity you will need to create a covering index on all 3 fields, and then include all 3 in your WHERE
clauses so you get optimum seeks.
Upvotes: 0
Reputation: 13367
How about creating a separate table that stores these "flags", foreign key'd back to your original table?
Table1 (original table)
----------------------
PriKey1 (PK for Table1)
Col1
Col2
Table2 (new table)
------------------
PriKey2 (PK for Table2)
PriKey1 (FK to Table1)
A
B
C
Upvotes: 0
Reputation: 776
I personally wouldn't do it that way, I would create another table that would store either the A, B, or C
and the RecordID.
Upvotes: 0