hackartist
hackartist

Reputation: 5264

SQL Schema Design Advice

I have a 'users' table which has a bunch of concrete "sure" properties about my users all of which must be there and their veracity is certain and then I have a separate table 'users_derived' where all data in this table is derived properties of my users guessed by machine learning models. For example: 'age' might be a certain property since they supplied it to me, 'height' or 'hair color' might be a derived property since an ML model guessed it from a picture. The main difference is all properties in the 'users' table were given to me by the user themselves and have complete certainty whereas all properties in the 'user_derived' table have both the value and a certainty associated with it and were guessed at by my system. The other difference is all properties of the 'users' table will be there for every user, while any property in the 'users_derived' table may or may not be there. From time to time I add new ML models which guess at more properties of users as well.

My question is how to do the schema for the 'users_derived' table. I could do it like this:

userid  |  prop1  | certainty1  |  prop2  | certainty2 | prop3 |  etc ...
123         7         0.57         5'8''       0.82       red
124         12        0.6          NULL        NULL       black
125         NULL      NULL         6'1''       0.88       blonde

or I could do it like this with slightly different indexing:

userid   |  property  |  value   |   certainty 
 123           1           7            0.57
 123           2          5'8''         0.82
 124           1           12           0.60
 123           3          red           0.67
 124           3          black         0.61
 125           2          6'1''         0.88
                       etc ....

So the tradeoffs seem like in the second way it isn't as normalized and might be slightly harder to query but you don't have to know all the properties you care about in advance -- that is if I want to add a new property there is no schema change. Also there don't have to be any NULL spots since if we don't have that property yet we just don't have a row for it. What am I missing? What are the benefits of the first way? Are there queries I can do against the first schema that are hard or impossible in the second schema? Does the second way somehow need more space for indexing to make it fast?

Upvotes: 0

Views: 54

Answers (1)

rd_nielsen
rd_nielsen

Reputation: 2459

The second way is more normalized. Both the table and the indexes are likely to be more compact, especially if the first form is relatively sparsely populated. Although the two forms have different tradeoffs for different queries, in general the second form is more flexible and better suited to a wide variety of queries. If you want to transform data from the normalized form to the crosstabbed form, there is a crosstab function in Postgres' tablefunc extension that can be used for this purpose. Normalizing crosstabbed data will be more difficult, especially if the number of columns is indeterminate--yet you may need to do that for some types of queries.

Upvotes: 2

Related Questions